Zero-Shot Learning in Single-Cell Foundation Models: A Realistic Assessment of Capabilities and Limitations for Biomedical Research

Logan Murphy Nov 27, 2025 309

This article provides a comprehensive analysis of the zero-shot capabilities of single-cell foundation models (scFMs), which are large-scale AI models pre-trained on millions of single-cell transcriptomes.

Zero-Shot Learning in Single-Cell Foundation Models: A Realistic Assessment of Capabilities and Limitations for Biomedical Research

Abstract

This article provides a comprehensive analysis of the zero-shot capabilities of single-cell foundation models (scFMs), which are large-scale AI models pre-trained on millions of single-cell transcriptomes. Aimed at researchers, scientists, and drug development professionals, it explores the foundational concepts of scFMs, their practical applications in tasks like cell type annotation and batch integration, and rigorous benchmarking that reveals their current performance gaps compared to simpler methods. Synthesizing the latest 2025 research, the article also covers strategies for optimizing model utility, introduces novel biology-driven evaluation metrics, and discusses the future trajectory of these tools in advancing drug discovery and clinical applications.

Understanding Single-Cell Foundation Models and the Critical Role of Zero-Shot Evaluation

What Are Single-Cell Foundation Models? Defining the AI Paradigm for Cell Biology

Single-cell foundation models (scFMs) represent a transformative paradigm in computational biology, leveraging large-scale deep learning architectures pre-trained on massive single-cell datasets to enable a wide range of downstream analytical tasks. These models are built on the premise that by exposing an artificial intelligence system to millions of single-cell profiles encompassing diverse tissues, species, and biological conditions, the model can learn fundamental principles of cellular biology that generalize to new datasets and applications [1] [2]. Inspired by the revolutionary success of foundation models in natural language processing and computer vision, researchers have adapted these approaches to decipher the "language of cells," where individual cells are treated analogously to sentences, and genes or genomic features serve as words or tokens [1].

The significance of scFMs lies in their potential to overcome critical challenges in single-cell biology, including the high dimensionality, sparsity, and technical noise inherent in single-cell sequencing data [2]. By capturing universal patterns across vast collections of single-cell measurements, these models aim to provide a unified framework for analyzing cellular heterogeneity, regulatory networks, and biological systems at unprecedented scale and resolution. The emergence of public data archives containing tens of millions of single-cell omics datasets has created the fertile ground needed for training these sophisticated models, enabling researchers to move from targeted analyses of individual experiments to generalized computational approaches that leverage aggregated biological knowledge [1].

Architectural Framework and Core Components

Model Architecture and Tokenization Strategies

Most single-cell foundation models are built on transformer architectures, which have demonstrated remarkable success in capturing complex relationships in sequential data. The adaptation of transformers to single-cell data requires innovative solutions to address the non-sequential nature of biological measurements. Unlike words in a sentence, genes in a cell have no inherent ordering, necessitating specialized tokenization approaches that convert gene expression profiles into structured input sequences [1].

Common tokenization strategies include ranking genes within each cell by expression levels, creating a deterministic sequence based on expression magnitude. Alternative approaches partition genes into expression value bins or use normalized counts directly [1]. The tokenization process typically generates three core components: gene embeddings (representing gene identity), value embeddings (capturing expression levels), and positional embeddings (providing sequence context). Some models incorporate special tokens for cell identity, experimental metadata, or modality indicators when handling multi-omics data [2]. These embeddings are processed through multiple transformer layers with self-attention mechanisms that learn to weight relationships between gene tokens, effectively capturing co-expression patterns and regulatory relationships [1].

Table 1: Architectural Variations in Single-Cell Foundation Models

Model Type Architecture Tokenization Approach Primary Application
Encoder-based (BERT-like) Bidirectional attention Gene ranking or binning Cell classification, embedding generation
Decoder-based (GPT-like) Unidirectional masked attention Expression-based sequencing Gene expression prediction, generation
Hybrid Designs Encoder-decoder combinations Multi-modal integration Cross-modal translation, complex inference
Pretraining Objectives and Knowledge Capture

ScFMs are typically pretrained using self-supervised learning objectives that don't require manually labeled data. The most common approach is masked language modeling, where the model is trained to predict the expression of randomly masked genes given the context of other genes in the cell [3]. This training paradigm encourages the model to learn biological relationships between genes, such as co-regulation within pathways or functional modules. The underlying hypothesis is that successfully predicting masked gene expressions requires understanding the complex dependencies and interactions within cellular systems [1] [2].

During pretraining, models develop rich internal representations at both the gene and cell levels. Gene embeddings capture functional similarities, while cell embeddings encode cellular states and types [2]. The attention mechanisms in transformer layers potentially learn to identify key regulatory relationships and biological pathways. However, recent evaluations have raised questions about the depth of biological knowledge actually captured during pretraining, as models sometimes fail to outperform simpler methods on fundamental tasks [4] [3].

Critical Evaluation of Zero-Shot Capabilities

Performance Benchmarks in Zero-Shot Settings

Zero-shot evaluation, where models are applied to downstream tasks without any task-specific training, represents the most rigorous test of a foundation model's generalization capabilities and biological understanding. This assessment approach is particularly critical for discovery settings where labels are unknown or task-specific training is impractical [4]. Recent comprehensive evaluations of popular scFMs like Geneformer and scGPT have revealed significant limitations in their zero-shot performance across fundamental analytical tasks.

In cell type clustering, both Geneformer and scGPT underperform established methods such as scVI and Harmony, as well as simple approaches like selecting highly variable genes (HVG). Quantitative assessments using metrics like average BIO score demonstrate that these foundation models struggle to separate known cell types across multiple datasets, with performance inconsistencies that aren't fully explained by overlap between evaluation and pretraining datasets [4]. Similarly, in batch integration tasks, which aim to remove technical artifacts while preserving biological variation, scFMs show limited effectiveness. Geneformer's embeddings often fail to retain cell type information, with clustering primarily driven by batch effects rather than biological signals [4].

Table 2: Zero-Shot Performance Comparison Across Single-Cell Analytical Tasks

Method Cell Type Clustering (AvgBIO Score) Batch Integration (iLISI Score) Gene Expression Prediction (Pearson Correlation)
scGPT 0.45-0.62 0.51-0.65 0.08-0.22 (without cell embedding)
Geneformer 0.38-0.55 0.42-0.58 Not comprehensively evaluated
scVI 0.58-0.71 0.63-0.75 N/A
Harmony 0.54-0.69 0.59-0.72 N/A
HVG Selection 0.61-0.73 0.67-0.78 N/A
Biological Relevance and Representation Learning

Beyond quantitative metrics, researchers have developed novel approaches to assess the biological relevance of representations learned by scFMs. The scGraph-OntoRWR metric measures the consistency between cell type relationships captured by model embeddings and established biological knowledge from cell ontologies [2]. Similarly, gene embeddings can be evaluated by their ability to predict functional relationships, tissue specificity, and Gene Ontology terms [2].

These analyses reveal that while scFMs capture some biological structure, their representations don't consistently outperform simpler alternatives or directly align with known biological hierarchies. The discrepancy between the promising conceptual framework of scFMs and their practical performance limitations suggests several potential issues: the masked language modeling objective may not optimally transfer to downstream tasks, models may require different architectural approaches to effectively capture biological complexity, or current training datasets may lack the diversity or quality needed for robust generalization [4] [3].

Experimental Protocols for scFM Evaluation

Protocol 1: Zero-Shot Cell Type Annotation

Purpose: To evaluate the capability of scFMs to generate cell embeddings that separate cell types without task-specific training, simulating discovery settings where cell type labels are unknown.

Materials:

  • Single-cell RNA sequencing dataset with ground truth cell type labels
  • Pretrained scFM (e.g., scGPT, Geneformer, UCE, scFoundation)
  • Comparison methods (scVI, Harmony, HVG selection)
  • Computing environment with adequate GPU resources

Procedure:

  • Data Preprocessing:
    • Load target dataset and apply standard quality control filters
    • Normalize gene expression values if required by the specific foundation model
    • Note: Do not perform model fine-tuning or task-specific training
  • Embedding Generation:

    • Extract cell embeddings from the foundation model using its zero-shot capabilities
    • For scGPT: use the cell embedding from the special [CLS] token
    • For Geneformer: extract the cell representation from the final layer
  • Dimensionality Reduction and Clustering:

    • Apply UMAP or t-SNE to embeddings for visualization
    • Perform Leiden clustering on the embedding space
    • Compare cluster identities with ground truth cell type labels
  • Quantitative Assessment:

    • Calculate ARI (Adjusted Rand Index) between clusters and true labels
    • Compute NMI (Normalized Mutual Information) to evaluate cluster purity
    • Determine ASW (Average Silhouette Width) for cell type separation quality
    • Apply ontology-informed metrics (LCAD) to assess biological relevance of errors

Interpretation: High ARI and NMI scores indicate strong zero-shot clustering performance. Comparison with baseline methods reveals whether the foundation model provides advantages over established approaches. The LCAD metric helps determine if misclassifications are biologically reasonable (closely related cell types) or severe (distantly related types) [2].

Protocol 2: Batch Integration Assessment

Purpose: To evaluate the ability of scFMs to remove technical batch effects while preserving biological variation in zero-shot settings.

Materials:

  • Single-cell dataset with known batch effects and biological groups
  • Pretrained scFM and baseline integration methods
  • Metrics for batch mixing and biological conservation

Procedure:

  • Experimental Setup:
    • Select dataset with significant technical variation (different platforms, protocols, or laboratories)
    • Ensure the dataset contains known biological groups for conservation assessment
  • Embedding Generation and Integration:

    • Generate zero-shot cell embeddings using the foundation model
    • Compare against dedicated batch correction methods (Harmony, scVI)
    • Include simple baselines (HVG selection) for reference
  • Dual-Metric Evaluation:

    • Calculate batch mixing scores (iLISI, PCR) to quantify technical effect removal
    • Compute biological conservation scores (cLISI, ASW) to assess preservation of real variation
    • Visualize integration results using UMAP, coloring by batch and cell type
  • Comparative Analysis:

    • Rank methods by their ability to simultaneously minimize batch effects and preserve biology
    • Assess dataset-specific performance patterns across tissue types and technologies

Interpretation: Effective batch integration should achieve high batch mixing scores while maintaining high biological conservation. The critical assessment is whether foundation models provide advantages over specialized integration methods, particularly for complex batch effects involving both technical and biological sources of variation [4] [2].

Visualization of Model Architectures and Evaluation Workflows

architecture cluster_input Input Data cluster_embedding Embedding Layer cluster_transformer Transformer Layers cluster_output Output Layer cluster_tasks Zero-Shot Applications RawData Single-Cell Expression Matrix Tokenization Tokenization (Gene Ranking/Binning) RawData->Tokenization GeneEmbed Gene Embeddings Tokenization->GeneEmbed SpecialTokens Special Tokens (Cell ID, Batch, Modality) SpecialTokens->GeneEmbed CombinedEmbed Combined Input Embeddings GeneEmbed->CombinedEmbed ValueEmbed Expression Value Embeddings ValueEmbed->CombinedEmbed PositionEmbed Positional Embeddings PositionEmbed->CombinedEmbed MultiHeadAttention Multi-Head Self-Attention CombinedEmbed->MultiHeadAttention LayerNorm Layer Normalization MultiHeadAttention->LayerNorm FeedForward Feed-Forward Network TransformerOutput Context-Aware Representations FeedForward->TransformerOutput LayerNorm->FeedForward CellEmbedding Cell Embedding ([CLS] Token) TransformerOutput->CellEmbedding GeneRepresentations Gene Representations TransformerOutput->GeneRepresentations PretrainingHead Masked Gene Prediction Head TransformerOutput->PretrainingHead CellTypeAnnotation Cell Type Annotation CellEmbedding->CellTypeAnnotation BatchIntegration Batch Effect Integration CellEmbedding->BatchIntegration BiologicalInsights Biological Relationship Mining GeneRepresentations->BiologicalInsights

Single-Cell Foundation Model Architecture

evaluation cluster_input Input Dataset cluster_models Model Comparison cluster_tasks Evaluation Tasks cluster_metrics Performance Metrics cluster_output Evaluation Outcome Dataset scRNA-seq Dataset With Ground Truth Labels scFM Foundation Model (Zero-Shot) Dataset->scFM BaselineMethods Baseline Methods (scVI, Harmony, HVG) Dataset->BaselineMethods CellClustering Cell Type Clustering scFM->CellClustering BatchIntegration Batch Effect Integration scFM->BatchIntegration GenePrediction Gene Expression Prediction scFM->GenePrediction BaselineMethods->CellClustering BaselineMethods->BatchIntegration BaselineMethods->GenePrediction TraditionalMetrics Traditional Metrics (ARI, NMI, ASW) CellClustering->TraditionalMetrics BiologicalMetrics Biological Metrics (Ontology Alignment) CellClustering->BiologicalMetrics BatchIntegration->TraditionalMetrics GenePrediction->TraditionalMetrics ComparativeRanking Comparative Performance Ranking TraditionalMetrics->ComparativeRanking BiologicalMetrics->ComparativeRanking Recommendation Method Recommendation Based on Task Requirements ComparativeRanking->Recommendation

Zero-Shot Evaluation Workflow

Essential Research Reagent Solutions

Table 3: Key Computational Tools for Single-Cell Foundation Model Research

Tool Category Representative Solutions Primary Function Application Context
Foundation Models scGPT, Geneformer, UCE, scFoundation, LangCell Large-scale pretrained models for single-cell data Zero-shot inference, transfer learning, biological discovery
Baseline Methods scVI, Harmony, Seurat, SC3 Established single-cell analysis pipelines Performance benchmarking, method comparison
Evaluation Metrics ARI, NMI, ASW, LISI, scGraph-OntoRWR Quantitative performance assessment Model validation, biological relevance quantification
Data Resources CZ CELLxGENE, Human Cell Atlas, PanglaoDB Curated single-cell datasets Model pretraining, benchmarking, transfer evaluation
Visualization Tools SCope, UCSC Cell Browser Interactive data exploration Result interpretation, quality assessment, publication graphics

Single-cell foundation models represent an ambitious paradigm shift in computational biology, aiming to create universal models that capture fundamental principles of cellular biology. While their conceptual framework is promising, current evaluations reveal significant limitations in zero-shot settings, where these models often underperform simpler, specialized methods [4] [3]. The discrepancy between the theoretical potential and practical performance highlights the need for continued research into model architectures, pretraining objectives, and evaluation methodologies.

Future advancements in scFMs will likely focus on several critical areas: developing more biologically meaningful pretraining objectives that better transfer to downstream tasks, incorporating multi-modal data to create more comprehensive cellular representations, improving model interpretability to extract actionable biological insights, and establishing rigorous standardized benchmarks that assess true biological understanding rather than just analytical performance [1] [2]. As these models continue to evolve, they hold the potential to transform our approach to cellular biology, enabling discoveries that bridge molecular mechanisms, cellular functions, and physiological systems through integrated AI-driven analysis.

The advent of single-cell RNA sequencing (scRNA-seq) has unveiled unprecedented resolution for exploring cellular heterogeneity. Concurrently, the transformer architecture, which has revolutionized natural language processing (NLP), is now being repurposed to interpret the "language of biology" encoded in gene expression data [1]. This convergence has given rise to single-cell foundation models (scFMs), large-scale deep learning models pretrained on vast atlases of single-cell data [1] [5]. A critical, yet underexplored, capability of these models is zero-shot learning, where the model makes predictions on novel tasks or datasets without any task-specific fine-tuning [4]. This is paramount in biological discovery settings where cell type compositions or states are unknown a priori [4] [6]. The performance of these models in a zero-shot setting hinges on two core architectural components: the tokenization process, which converts raw, non-sequential gene expression data into a structured sequence of discrete units, and the transformer model itself, which processes these tokens to learn complex, generalizable representations of cellular state [1]. This application note details the methodologies for these core components, framed within the context of zero-shot learning research, to provide researchers with the protocols needed to understand, evaluate, and apply these cutting-edge tools.

Tokenization: Converting Gene Expression to Model Input

Tokenization is the foundational step that standardizes raw, continuous, and non-sequential gene expression data into a structured format that transformer models can process. Unlike words in a sentence, genes in a cell have no inherent order, making the tokenization strategy for scRNA-seq data a critical design choice [1].

Tokenization Strategies and Protocols

The following protocols describe the primary methods for tokenizing gene expression data. The choice of method can significantly impact model performance and biological interpretability.

  • Protocol 2.1.1: Tokenization by Gene Expression Ranking

    • Objective: To create a deterministic input sequence by ranking genes based on their expression magnitude, providing a consistent order for the transformer.
    • Materials: A cell-by-gene count matrix (e.g., from Cell Ranger or Scanpy), computational environment (e.g., Python, R).
    • Method Details:
      • Input: For a single cell, start with a vector of normalized gene expression counts for all G genes.
      • Ranking: Sort the genes in descending order based on their expression values.
      • Selection: Select the top K genes (where K is a predefined sequence length, e.g., 1200) to form the input sequence.
      • Token Generation: Each gene in the ranked list is treated as a token. The token incorporates an embedding of the gene's identifier (e.g., its Ensembl ID) [1] [7].
      • Positional Encoding: Apply standard transformer positional encodings based on the gene's rank in the sequence (1st, 2nd, ..., Kth).
  • Protocol 2.1.2: Tokenization by Expression Value Binning

    • Objective: To incorporate quantitative expression levels directly into the tokenization scheme by binning expression values.
    • Materials: A cell-by-gene count matrix, normalized and log1p-transformed data.
    • Method Details:
      • Input: Normalized gene expression data for a single cell.
      • Binning: Discretize continuous expression values into N quantile bins (e.g., deciles, Q1 through Q10). This transforms a continuous value into a categorical token representing its expression level relative to the population [7].
      • Token Generation: Each gene is represented by a combination of its identity and its expression bin token. For instance, a token could be "CD4_Q5" representing the CD4 gene with a median expression level.
      • Model Insight: Models like ETHOS have demonstrated that this approach allows the transformer to learn the sequential relationship between quantile tokens, with embeddings for high quantiles (e.g., Q9, Q10) showing greater separation, potentially reflecting their heightened clinical significance [7].
  • Protocol 2.1.3: Integration of Special and Metadata Tokens

    • Objective: To enrich the model's context by providing information beyond gene expression, such as cell-level metadata or experimental batch.
    • Materials: Gene expression matrix accompanied by relevant metadata (e.g., donor ID, sequencing batch, tissue of origin).
    • Method Details:
      • Special Tokens: Prepend or append special tokens to the gene sequence. A common example is a [CELL] token, whose final embedding is used as a summary representation for the entire cell [1] [5] [8].
      • Batch Tokenization: Incorporate a token representing the batch or study of origin to help the model learn and potentially correct for technical artifacts [5].
      • Multi-modal Tokenization: For integrated models, include modality-specific tokens (e.g., [ATAC] or [PROTEIN]) to process multi-omics data within a single framework [1].

The diagram below illustrates the logical workflow for processing raw single-cell data into a tokenized sequence ready for transformer input.

G RawData Raw scRNA-seq Data (Cell-by-Gene Matrix) Normalization Data Normalization (log1p, etc.) RawData->Normalization TokenizationMethod Tokenization Strategy Normalization->TokenizationMethod Ranking Gene Ranking TokenizationMethod->Ranking Binning Value Binning TokenizationMethod->Binning SpecialTokens Add Special Tokens e.g., [CELL], [BATCH] Ranking->SpecialTokens Binning->SpecialTokens InputSequence Tokenized Input Sequence SpecialTokens->InputSequence

Comparative Analysis of Tokenization Approaches

Table 1: Comparison of primary tokenization strategies for single-cell gene expression data.

Tokenization Strategy Key Principle Advantages Limitations Representative Models
Gene Expression Ranking Orders genes by expression level to create a sequence. Provides a deterministic input order; simple to implement. The arbitrary sequence may not reflect biological gene-gene relationships. Geneformer [1] [4]
Expression Value Binning Discretizes continuous expression into quantile bins. Encodes quantitative expression levels directly into tokens. May lose fine-grained, continuous information. ETHOS [7]
Identity-Only Uses gene identities with normalized counts, minimal structuring. Simple; reports suggest complex ranking may offer no clear advantage [8]. May require more data or model capacity to learn expression patterns. scGPT (option) [1] [8]

Transformer Architecture and Zero-Shot Workflow

The transformer architecture processes the tokenized sequences to build a contextualized understanding of cellular state. The model's pretraining objective is designed to instill this general knowledge, which is then directly accessed in a zero-shot manner.

Model Architecture and Pretraining Protocol

  • Protocol 3.1.1: Model Pretraining with Masked Language Modeling
    • Objective: To train a transformer model on a large, unlabeled corpus of single-cell data so it learns fundamental biological principles and gene-gene relationships.
    • Materials: A large-scale collection of single-cell datasets (e.g., from CZ CELLxGENE, Human Cell Atlas), high-performance computing resources with multiple GPUs.
    • Method Details:
      • Architecture Selection:
        • Encoder-based (BERT-like): Uses bidirectional attention, meaning all tokens in a sequence attend to all other tokens simultaneously. This is effective for tasks that require a comprehensive understanding of the entire cell state, such as cell type classification [1] [4]. Models: scBERT.
        • Decoder-based (GPT-like): Uses causal (unidirectional) attention, where a token can only attend to previous tokens in the sequence. This is often used for generative tasks, such as predicting masked genes or simulating future cell states [1] [5]. Models: scGPT.
      • Pretraining Task - Masked Language Modeling (MLM): Randomly mask a portion (e.g., 15-20%) of the gene tokens in the input sequence. The model is then trained to predict the identity (and sometimes the expression value) of the masked genes based on the context provided by the unmasked genes [1] [5]. This forces the model to learn the complex, co-varying relationships between genes.
      • Output - Cell Embedding: The activation state of the special [CELL] token (or the average of all output token embeddings) at the final layer is used as a fixed-dimensional vector representation (embedding) that summarizes the entire cell's state [1] [4].

Zero-Shot Inference and Evaluation Protocol

  • Protocol 3.2.1: Performing Zero-Shot Cell Type Clustering
    • Objective: To use the pretrained model's cell embeddings to cluster cells into types without any further training or fine-tuning on the target dataset.
    • Materials: A pretrained scFM (e.g., scGPT, Geneformer), a new target scRNA-seq dataset (processed and tokenized), clustering algorithms (e.g., Leiden, K-means).
    • Method Details:
      • Inference: Pass the tokenized target dataset through the pretrained model.
      • Embedding Extraction: For each cell, extract the cell embedding vector from the model's output.
      • Dimensionality Reduction: Apply techniques like UMAP or t-SNE to the matrix of cell embeddings for visualization.
      • Clustering: Apply a clustering algorithm to the cell embeddings to identify groups of transcriptionally similar cells.
      • Evaluation: Compare the model's clusters to known cell type labels using metrics like Average Silhouette Width (ASW) or Adjusted Rand Index (ARI). Compare performance against established baselines like highly variable genes (HVG) coupled with scVI or Harmony [4].

The following diagram outlines the complete workflow from pretraining to zero-shot evaluation.

G PretrainData Large-scale scRNA-seq Pretraining Corpus Tokenization Tokenization PretrainData->Tokenization NewData New Target Dataset (No Labels) Transformer Transformer Model (Encoder or Decoder) Tokenization->Transformer PretrainTask Self-Supervised Pretraining (Masked Language Modeling) Transformer->PretrainTask PretrainedModel Pretrained Foundation Model PretrainTask->PretrainedModel Inference Zero-Shot Inference PretrainedModel->Inference Tokenization2 Tokenization NewData->Tokenization2 Tokenization2->PretrainedModel Embedding Cell Embeddings Inference->Embedding Analysis Downstream Analysis (Clustering, Visualization) Embedding->Analysis

Performance of Zero-Shot Models

Recent rigorous evaluations of scFMs in zero-shot settings have revealed critical insights into their current capabilities and limitations.

Table 2: Zero-shot performance of single-cell foundation models on key tasks compared to baseline methods. Performance is summarized from Kedzierska et al. [4].

Model / Baseline Cell Type Clustering (AvgBIO Score) Batch Integration (iLISI Score) Key Findings and Limitations
HVG + PCA Best Best A simple baseline of Highly Variable Genes with PCA surprisingly outperformed foundation models on multiple datasets and metrics [4].
scVI Better Better A specialized deep learning model for scRNA-seq consistently showed strong performance in both clustering and batch integration [4].
Harmony Better Better A robust batch integration method performed well, particularly on technical batch effects [4].
scGPT Variable Intermediate Shows inconsistent performance; pretraining helps but does not consistently surpass simpler methods. Struggles with complex biological batch effects [4].
Geneformer Worse Worse Underperforms relative to all other methods and baselines in zero-shot evaluation; embeddings often dominated by batch effects [4].

The Scientist's Toolkit

Research Reagent Solutions

Table 3: Essential computational tools and resources for working with single-cell foundation models.

Item Function / Description Example / Source
Pretraining Data Large, aggregated single-cell datasets used to train foundation models. Provides the "corpus" of cellular states. CZ CELLxGENE [1] [4], Human Cell Atlas [1], PanglaoDB [1]
Model Architectures The specific implementation of the transformer model (encoder or decoder). scGPT (decoder) [1] [5], scBERT (encoder) [1] [4], Geneformer (encoder) [4]
Evaluation Benchmarks Standardized datasets and metrics for fairly comparing model performance, especially zero-shot. Pancreas dataset [4], Tabula Sapiens [4], Immune cell datasets [4]
Baseline Methods Established, often simpler, computational methods that serve as a critical point of comparison. Highly Variable Genes (HVG) [4], scVI [4], Harmony [4]
Visualization Tools Software libraries for visualizing high-dimensional cell embeddings and model attention. UMAP, t-SNE, Scanpy [9]

The core architecture of transformers, fed by thoughtfully tokenized gene expression data, provides a powerful framework for building foundation models in single-cell biology. The protocols outlined here for tokenization, model pretraining, and zero-shot evaluation provide a roadmap for researchers to implement and critically assess these technologies. However, current evidence indicates that the promise of robust, out-of-the-box zero-shot inference has not yet been fully realized, with simpler methods often outperforming large, complex foundation models on tasks like cell type clustering and batch integration [4] [5]. This underscores the importance of rigorous zero-shot evaluation as a mandatory step in the development and application of scFMs. Future progress will likely depend on more biologically informed tokenization strategies [10], novel pretraining objectives that better capture hierarchical cellular relationships, and a continued focus on model interpretability and reliability for zero-shot tasks in exploratory research and drug development.

The advent of single-cell RNA sequencing (scRNA-seq) has revolutionized biology by enabling the profiling of gene expression at the resolution of individual cells, uncovering cellular heterogeneity with unprecedented precision [11] [12]. However, the analysis of scRNA-seq data is fraught with challenges stemming from its high dimensionality, technical noise, and sparsity [12] [13]. Foundation models pretrained on millions of single-cell transcriptomes have emerged as a powerful strategy to overcome these hurdles. These models aim to learn universal patterns of gene expression and cell states from large-scale data, creating a foundational knowledge that can be rapidly specialized for diverse downstream tasks with minimal additional training [4].

The significance of these models is particularly pronounced in the context of zero-shot learning, where the model's internal representation of input data is used for analysis without any task-specific fine-tuning [4]. This capability is critical for exploratory biological discovery where predefined labels are unknown, making fine-tuning infeasible. This application note details the pretraining process, data requirements, model architectures, and evaluation protocols for building and validating single-cell foundation models, with a specific focus on their zero-shot capabilities.

Data Acquisition and Curation

The efficacy of a foundation model is fundamentally dependent on the scale and quality of its pretraining data. Assembling a massive, diverse, and well-curated corpus of single-cell data is the first and most critical step.

Large-scale single-cell datasets are aggregated from various public repositories, including:

  • National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO)
  • European Nucleotide Archive (ENA)
  • Genome Sequence Archive (GSA)
  • CellxGENE database [14] [15]

These datasets are stored in multiple formats (e.g., FASTQ, h5ad, Seurat objects), requiring standardized processing pipelines for consolidation [15].

Data Processing and Standardization

A uniform workflow is essential to convert raw data into a clean, analysis-ready gene expression matrix. Key steps include:

  • Quality Control: Filtering out low-quality cells and genes based on metrics like mitochondrial read percentage and gene counts.
  • Gene Name Standardization: Standardizing gene identifiers according to the HUGO Gene Nomenclature Committee (HGNC) guidelines.
  • Format Conversion: Converting all data into a unified sparse matrix format for efficient storage and computation [15].

Table 1: Exemplary Large-Scale Pretraining Datasets for Single-Cell Foundation Models

Model Pretraining Dataset Scale Data Composition Primary Source
CellFM [15] ~100 million human cells 46.3M normal cells, 7.1M viral infection cells, 3.5M lung cancer cells; diverse cell types (T cells, neurons, etc.) Public repositories (GEO, ENA, GSA)
scPRINT [14] >50 million cells Multiple species, diseases, and ethnicities CellxGENE database
scGPT [4] >33 million non-cancerous human cells Includes blood, bone marrow, and kidney cells CELLxGENE initiative
Geneformer [4] 30 million single-cell transcriptomes Diverse human tissues Not specified

Model Architectures and Pretraining Strategies

Single-cell foundation models adapt architectures from natural language processing, treating genes as words and a cell's expression profile as a sentence. The choice of architecture and how gene expression is "tokenized" are pivotal design decisions.

Tokenization Strategies for Gene Expression

A key challenge is converting continuous gene expression values into discrete tokens or embeddings suitable for model input. The field has converged on three primary strategies:

Table 2: Comparison of Gene Expression Tokenization Strategies

Tokenization Strategy Mechanism Representative Models Advantages Limitations
Rank-based [12] Genes are ranked by expression level within each cell; the sequence of gene names forms the model input. Geneformer, GeneMamba, tGPT Robust to batch effects; captures relative expression. Discards absolute expression magnitude.
Value Categorization [15] Gene expression values are binned into discrete "buckets," transforming the task into classification. scBERT, scGPT Preserves some absolute expression information. May lose fine-grained resolution; sensitive to binning parameters.
Value Projection [12] [15] Continuous expression values are projected into an embedding space via a linear transformation or MLP. scFoundation, CellFM, scPRINT Preserves full data resolution; no information loss from binning. Diverges from traditional NLP tokenization.

Model Architectures

  • Transformer-based Models: Early models like scGPT and Geneformer leveraged the Transformer architecture for its powerful self-attention mechanism, which can model complex dependencies between genes [4] [15]. A significant limitation is the quadratic computational complexity of self-attention, which constrains scalability for long gene sequences [12].
  • State Space Models (SSMs): Newer models like GeneMamba adopt SSMs to address Transformer limitations. The BiMamba (Bidirectional Mamba) module efficiently captures gene context information with linear computational complexity, enabling scalable processing of over 50 million cells at a lower cost [12].
  • Hybrid and Variant Architectures: CellFM uses a modified RetNet framework (ERetNet Layers) to balance efficiency and performance, integrating a Gated Multi-head Attention unit and a LoRA (low-rank adaptation) module for efficient fine-tuning [15]. scPRINT uses a bidirectional transformer and incorporates protein embeddings from models like ESM2 as gene representations, leveraging evolutionary and structural priors [14].

Pretraining Objectives

Models are trained using self-supervised objectives that do not require manually labeled data. Common tasks include:

  • Masked Language Modeling (MLM): A random subset of genes in a cell's profile is masked, and the model is trained to recover their original expression values or ranks [4] [15].
  • Denoising and Upsampling: scPRINT employs a denoising task where the model learns to upsample transcript counts, helping to discriminate true zeros from technical dropouts [14].
  • Multi-Task Learning: scPRINT combines a denoising task, a bottleneck learning task (reconstructing expression from a compressed embedding), and a label prediction task (predicting cell type, disease, etc.) to create disentangled embeddings that represent different facets of the cell state [14].

The following diagram illustrates a generalized pretraining workflow that incorporates these common elements.

PretrainingWorkflow ScRNAseqData scRNA-seq Data Preprocessing Data Preprocessing (QC, Normalization, HVG) ScRNAseqData->Preprocessing Tokenization Expression Tokenization (Rank, Bin, or Project) Preprocessing->Tokenization ModelInput Model Input (Gene Tokens + Embeddings) Tokenization->ModelInput Pretraining Pretraining (Transformer, SSM, etc.) ModelInput->Pretraining FoundationModel Pretrained Foundation Model Pretraining->FoundationModel PretrainingObj Pretraining Objectives (MLM, Denoising, Multi-Task) PretrainingObj->Pretraining

Evaluating Zero-Shot Performance

Rigorous evaluation in a zero-shot setting is crucial to determine if pretraining has endowed the model with a general, transferable understanding of biology, especially for discovery-driven research where labels are unavailable [4].

Key Evaluation Tasks and Metrics

  • Cell Type Clustering: The model generates cell embeddings without fine-tuning, and clustering algorithms are applied. Performance is measured using metrics like Average BIO (AvgBIO) score and Average Silhouette Width (ASW), which assess the separation and cohesion of known cell types [4].
  • Batch Integration: The model's ability to correct for technical batch effects while preserving biological variation is tested. Metrics evaluate both batch mixing (e.g., batch integration scores) and biological conservation (e.g., principal component regression score) [4].
  • Gene Network Inference: For models like scPRINT, zero-shot ability to infer biologically plausible gene-gene interactions is a key benchmark, often validated against literature-curated networks or orthogonal data [14].

Experimental Protocol: Zero-Shot Cell Embedding and Clustering

Purpose: To evaluate the quality of cell representations learned during pretraining by assessing their ability to separate known cell types without any further model training [4].

Procedure:

  • Input Data: Obtain a hold-out test scRNA-seq dataset not seen during pretraining. Preprocess it according to the model's requirements (e.g., normalize, select the same highly variable genes).
  • Generate Embeddings: Pass the preprocessed expression matrix through the pretrained foundation model to extract a low-dimensional embedding vector for each cell.
  • Dimensionality Reduction (Optional): Apply UMAP or t-SNE to the embeddings for visualization in 2D.
  • Clustering: Apply a clustering algorithm (e.g., Louvain, K-means) directly to the cell embeddings.
  • Evaluation:
    • Calculate the AvgBIO score and ASW using the known ground-truth cell type labels.
    • Compare the results against baseline methods like Highly Variable Genes (HVG), scVI, and Harmony [4].

Interpretation: Strong performance indicates that the pretrained model's embeddings capture biologically meaningful structure relevant to cell identity. Underperformance may suggest limitations in the pretraining task or data [4].

Critical Findings from Zero-Shot Evaluations

Recent studies reveal that the zero-shot performance of foundation models can be inconsistent:

  • Models like Geneformer and scGPT can underperform simpler baseline methods (e.g., HVG selection, scVI, Harmony) on tasks like cell type clustering and batch integration [4] [5].
  • Pretraining generally confers an advantage over randomly initialized models, but performance does not always monotonically improve with larger and more diverse datasets [4].
  • Surprisingly, models do not always perform best on datasets that were included in their pretraining corpus, indicating an unclear relationship between the pretraining objective and specific downstream tasks [4].

The following workflow outlines the process for conducting a zero-shot evaluation, highlighting the comparison to established baselines.

ZeroshotEval TestDataset Test Dataset (Ground Truth Labels) FoundationModel Pretrained Foundation Model TestDataset->FoundationModel Baselines Baseline Methods (HVG, scVI, Harmony) TestDataset->Baselines ModelEmbedding Cell Embeddings FoundationModel->ModelEmbedding ClusteringEval Clustering & Evaluation (AvgBIO, ASW) ModelEmbedding->ClusteringEval BaselineResult Baseline Results Baselines->BaselineResult BaselineResult->ClusteringEval PerformanceReport Performance Report ClusteringEval->PerformanceReport

The Scientist's Toolkit: Research Reagent Solutions

This table details key computational tools and resources essential for working with single-cell foundation models.

Table 3: Essential Research Reagents and Tools for Single-Cell Foundation Model Research

Item Name Function / Application Specifications / Notes
cellxgene Database [4] [14] A curated source of massive-scale, annotated single-cell data for model pretraining. Provides standardized data from diverse tissues and species; critical for assembling large corpora.
scGPT [4] [15] A transformer-based foundation model for single-cell analysis. Uses value categorization tokenization; offers capabilities for cell type annotation and batch correction.
GeneMamba [12] A state space model (SSM) for efficient large-scale single-cell data processing. Uses BiMamba module for linear-complexity processing; employs rank-based discretization.
scPRINT [14] A transformer model designed for gene network inference with multi-task pretraining. Incorporates protein embeddings (ESM2) as gene priors; features denoising and label prediction tasks.
CellFM [15] A large-scale foundation model trained on 100 million human cells. Uses value projection and ERetNet architecture; focuses on gene function and perturbation prediction.
Harmony & scVI [4] Specialized, non-foundation model tools for batch integration and dimensionality reduction. Commonly used as strong baselines for evaluating the zero-shot batch integration performance of foundation models.
Scanpy [11] A scalable Python toolkit for analyzing single-cell gene expression data. Provides standard pipelines for data preprocessing, visualization, clustering, and trajectory inference.

Why Zero-Shot Evaluation is Crucial for Unbiased Biological Discovery

Single-cell foundation models (scFMs) represent a revolutionary advance in computational biology, trained on millions of single-cell gene expression profiles to learn fundamental biological principles. These models are typically built on transformer architectures and pretrained using self-supervised objectives, such as masked gene expression prediction, where the model learns to predict withheld genes based on contextual information from other genes [1]. The promise of scFMs lies in their potential to capture universal patterns of cellular function and organization that can generalize to diverse downstream applications without task-specific training.

Zero-shot evaluation refers to assessing model performance on new, unseen data without any further training or fine-tuning of the model parameters. This evaluation paradigm is particularly critical for biological discovery research, where researchers frequently encounter unexplored cellular states, novel disease contexts, or uncharacterized experimental conditions [4] [3]. In these scenarios, labeled data for fine-tuning is nonexistent, and models must rely entirely on knowledge acquired during pretraining. The ability to perform effectively in zero-shot settings indicates that a model has learned transferable biological concepts rather than merely memorizing patterns from its training data.

Quantitative Evidence: Current ScFMs Struggle with Zero-Shot Tasks

Recent rigorous evaluations of popular scFMs like Geneformer and scGPT have revealed significant limitations in their zero-shot capabilities across multiple biological tasks. The performance gaps between these complex foundation models and simpler baseline methods are substantial and consistent across diverse datasets.

Performance Deficits in Cell Type Clustering

Cell type clustering represents a fundamental task in single-cell analysis where models must group cells with similar biological functions while ignoring technical variations. When evaluated on this task in zero-shot settings, foundation models consistently underperform established methods:

Table 1: Zero-shot Performance in Cell Type Clustering (AvgBIO Score)

Method Pancreas Dataset PBMC (12k) Dataset Tabula Sapiens Immune Dataset
scGPT 0.41 0.52 0.38 0.45
Geneformer 0.32 0.36 0.29 0.34
scVI 0.58 0.49 0.55 0.62
Harmony 0.54 0.47 0.51 0.58
HVG 0.61 0.55 0.59 0.64

As illustrated in Table 1, both scGPT and Geneformer are outperformed by simpler methods across most datasets, with the simple Highly Variable Genes (HVG) selection approach consistently achieving superior performance [4]. This performance gap is particularly striking given that HVG represents a basic feature selection strategy rather than a sophisticated machine learning model.

Challenges in Batch Integration

Batch integration aims to remove technical artifacts from different experiments while preserving biological signal. This task is especially challenging for zero-shot evaluation because models must generalize across diverse experimental conditions:

Table 2: Batch Integration Performance (Batch Mixing Score)

Method Pancreas PBMC Tabula Sapiens Immune
scGPT 0.48 0.52 0.61 0.59
Geneformer 0.31 0.35 0.28 0.33
scVI 0.65 0.61 0.58 0.52
Harmony 0.62 0.58 0.45 0.63
HVG 0.71 0.66 0.68 0.69

Geneformer consistently ranks at the bottom across all batch integration metrics, while scGPT shows variable performance—excelling on datasets it encountered during pretraining but struggling with novel datasets [4]. Qualitative assessment reveals that Geneformer's embedding space often fails to retain meaningful cell type information, with clustering primarily driven by batch effects rather than biological signals [4].

G cluster_Pretraining Pretraining Phase cluster_Deployment Deployment Scenarios Pretraining Pretraining ZeroShot ZeroShot Limitations Model Limitations & Biases ZeroShot->Limitations Reveals True Generalization FineTuning FineTuning Overestimation Overestimated Capabilities FineTuning->Overestimation May Mask Pretraining Failures LargeScaleData Large-Scale Single-Cell Data SelfSupervised Self-Supervised Learning (Masked Gene Prediction) LargeScaleData->SelfSupervised FoundationModel Foundation Model SelfSupervised->FoundationModel Discovery Biological Discovery Context (No Labels Available) FoundationModel->Discovery Zero-Shot Applied Applied Research Context (Labels Available) FoundationModel->Applied Fine-Tuning Discovery->ZeroShot Applied->FineTuning

Diagram 1: The critical role of zero-shot evaluation in revealing true model capabilities beyond fine-tuning scenarios. Zero-shot testing exposes limitations that may be masked during fine-tuning evaluations.

Experimental Protocols for Zero-Shot Evaluation

Implementing rigorous zero-shot evaluation requires standardized protocols that assess model performance across biologically meaningful tasks without any parameter updates or task-specific adaptations.

Protocol 1: Cell Type Clustering Evaluation

Purpose: To evaluate a model's ability to generate embeddings that separate known cell types without explicit training on cell type labels.

Materials:

  • Test Dataset: A fully annotated single-cell RNA-seq dataset with validated cell type labels
  • Baseline Methods: Standard approaches including HVG selection, scVI, and Harmony
  • Evaluation Metrics: AvgBIO score, Average Silhouette Width (ASW)

Procedure:

  • Data Preprocessing: Normalize the test dataset using standard scRNA-seq pipelines without applying any batch correction
  • Embedding Generation:
    • Process the dataset through the foundation model in zero-shot mode to extract cell embeddings
    • Generate comparison embeddings using baseline methods (HVG, scVI, Harmony)
  • Clustering:
    • Apply standardized clustering algorithms (e.g., Leiden, K-means) to all embedding types
    • Use consistent parameters and random seeds across all methods
  • Evaluation:
    • Calculate clustering metrics against ground truth cell type labels
    • Compare performance across methods using statistical testing

Interpretation: Superior performance in this protocol indicates that a model's embeddings capture biologically relevant information about cell identity and function [4] [16].

Protocol 2: Batch Integration Assessment

Purpose: To assess a model's capability to remove technical batch effects while preserving biological variation.

Materials:

  • Batch-Controlled Dataset: A dataset containing the same cell types profiled across multiple batches or technologies
  • Evaluation Framework: Metrics including batch mixing scores and biological conservation metrics

Procedure:

  • Dataset Selection: Identify or create a benchmark dataset with known batch effects and biological signals
  • Embedding Generation: Extract zero-shot embeddings from the foundation model and baseline methods
  • Dimensionality Reduction: Apply UMAP or t-SNE for visualization (qualitative) and retain full embeddings for quantitative analysis
  • Quantitative Assessment:
    • Calculate batch mixing metrics (e.g., LISI scores) to assess technical effect removal
    • Compute biological conservation metrics (e.g., cell type ASW) to ensure biological signal preservation
  • Visual Inspection: Examine 2D projections to identify whether batch effects dominate the embedding space

Interpretation: Effective batch integration demonstrates that a model can generalize across technical variations, a crucial capability for real-world biological discovery [4].

Implementing robust zero-shot evaluation requires specific computational tools and resources. The following table outlines key components of the evaluation toolkit:

Table 3: Essential Research Reagents and Computational Tools

Resource Type Function in Evaluation Examples/Alternatives
Benchmark Datasets Data Provide standardized testing grounds for model comparison Tabula Sapiens, Pancreas datasets, PBMC datasets [4]
Evaluation Metrics Algorithm Quantify model performance across multiple dimensions AvgBIO, ASW, batch mixing scores, PCR [4]
Baseline Methods Software Establish performance baselines for meaningful comparison HVG selection, scVI, Harmony [4]
Unified Frameworks Platform Standardize model access and evaluation procedures BioLLM framework [17]
Visualization Tools Software Enable qualitative assessment of embedding quality UMAP, t-SNE plotting utilities

The BioLLM framework deserves particular attention as it provides standardized APIs for accessing diverse scFMs, eliminating architectural and coding inconsistencies that complicate rigorous comparison [17]. This framework supports both zero-shot and fine-tuning evaluation, enabling comprehensive assessment of model capabilities.

G cluster_Inputs Input Data cluster_Methods Evaluation Methods cluster_Metrics Evaluation Metrics RawData Single-Cell Expression Matrix FoundationModel Foundation Model (Zero-Shot) RawData->FoundationModel BaselineMethods Baseline Methods (HVG, scVI, Harmony) RawData->BaselineMethods CellMetadata Cell Metadata (Batch, Donor) Biological Biological Conservation (ASW, ARI, NMI) CellMetadata->Biological Technical Technical Integration (Batch Scores, LISI) CellMetadata->Technical GroundTruth Ground Truth Labels (Cell Type, Condition) GroundTruth->Biological Novel Knowledge-Driven (scGraph-OntoRWR, LCAD) GroundTruth->Novel FoundationModel->Biological FoundationModel->Technical FoundationModel->Novel BaselineMethods->Biological BaselineMethods->Technical BaselineMethods->Novel

Diagram 2: Comprehensive zero-shot evaluation workflow integrating multiple data types, evaluation methods, and performance metrics to assess foundation model capabilities.

Emerging Solutions and Future Directions

While current scFMs show limitations in zero-shot settings, research is advancing toward more robust solutions. Several promising approaches are emerging:

Improved Pretraining Strategies

Recent evidence suggests that pretraining dataset composition significantly impacts zero-shot performance. Studies evaluating scGPT variants pretrained on different tissue-specific datasets (kidney, blood, and general human cells) found that performance improvements plateau despite increased dataset diversity [4]. This indicates that simply scaling up data may be insufficient, and more sophisticated pretraining objectives are needed.

Efficient Adaptation Methods

Novel fine-tuning approaches that preserve pretrained knowledge show promise for enhancing zero-shot generalization. Techniques like drug-conditional adapters that train less than 1% of original foundation model parameters enable better molecular conditioning while maintaining rich biological representations [18]. This approach has demonstrated improved zero-shot generalization to unseen cell lines while preserving core model capabilities.

Biological Knowledge Integration

Incorporating biological prior knowledge through novel evaluation metrics represents another advancement. The scGraph-OntoRWR metric measures consistency between cell type relationships captured by scFMs and established biological knowledge from cell ontologies [16]. Similarly, the Lowest Common Ancestor Distance (LCAD) metric assesses the ontological proximity between misclassified cell types, providing more biologically meaningful error assessment [16].

Zero-shot evaluation provides an essential reality check for single-cell foundation models, revealing limitations that fine-tuning-based assessments often mask. Current evidence demonstrates that even popular scFMs like Geneformer and scGPT struggle to outperform simpler methods on fundamental tasks like cell type clustering and batch integration when deployed without additional training. These findings underscore the importance of rigorous zero-shot testing as a standard practice in model development and validation.

As the field progresses, improved pretraining strategies, efficient adaptation methods, and biologically-informed evaluation metrics will likely enhance the zero-shot capabilities of future foundation models. By maintaining focus on rigorous evaluation and acknowledging current limitations, the research community can develop more robust and biologically meaningful models that truly advance discovery in single-cell biology.

Single-cell Foundation Models (scFMs) are large-scale deep learning models pretrained on vast single-cell omics datasets, designed to capture universal biological patterns that can be adapted to various downstream tasks. This overview examines the architecture, performance, and application of leading scFMs, with particular focus on their capabilities in zero-shot learning environments where models are applied to new data without further training. The evaluation reveals a critical insight: while these models show significant promise, their zero-shot performance often lags behind simpler, established methods, highlighting a substantial gap between pretraining objectives and practical biological discovery applications.

Single-cell foundation models represent a transformative approach in computational biology, leveraging self-supervised learning on massive single-cell datasets to develop a fundamental understanding of cellular biology. These models are built on the premise that by exposing an algorithm to millions of cells across diverse tissues, conditions, and species, it can learn the intrinsic "language" of cells and genes, capturing complex relationships that enable generalization to novel biological questions [1]. The emergence of scFMs parallels developments in natural language processing, where foundation models have revolutionized how machines understand and generate human language. In the biological context, individual cells are treated analogously to sentences, while genes or genomic features serve as words or tokens that collectively define cellular identity and function [1].

The significance of scFMs is particularly pronounced in zero-shot learning scenarios, which are essential for true biological discovery. In zero-shot settings, models must make predictions on new, unseen data without any further training, mimicking the exploratory nature of biological research where predefined labels are often unavailable [4]. This capability is critical for applications such as novel cell type identification, where researchers encounter unannotated data from experiments investigating previously uncharacterized biological conditions. Despite the theoretical promise, rigorous evaluation of scFMs in zero-shot contexts has revealed significant limitations, suggesting that current models may not yet fulfill their potential for transformative biological discovery without additional specialized training [4] [3].

Architectural Landscape of Key scFMs

Model Architectures and Pretraining Strategies

scFMs predominantly utilize transformer-based architectures, which employ attention mechanisms to weight the importance of different genes when making predictions about cellular states. The two primary architectural paradigms are encoder-based models (e.g., scBERT, Geneformer) and decoder-based models (e.g., scGPT), with some implementations using hybrid designs [1]. These models vary significantly in their parameter counts, pretraining datasets, and specific architectural implementations, leading to diverse performance characteristics across different biological tasks.

Table 1: Architectural Overview of Leading Single-Cell Foundation Models

Model Name Architecture Type Parameters Pretraining Dataset Size Key Innovations
Geneformer Transformer Encoder 40 million 30 million cells Rank-based gene tokenization; attention regularization
scGPT GPT-style Decoder 50 million 33 million cells Multi-omic support; generative pretraining
scBERT BERT-style Encoder Not specified Millions of cells Focus on cell type annotation
UCE Transformer Encoder 650 million 36 million cells Protein language model embeddings for genes
scFoundation Encoder-Decoder 100 million 50 million cells Read-depth-aware masked gene modeling
GeneMamba State Space Model Not specified >50 million cells BiMamba module for long-sequence efficiency

Input Representation and Tokenization Strategies

A fundamental challenge in adapting transformer architectures to single-cell data is the non-sequential nature of gene expression, unlike the inherent sequence in natural language. To address this, scFMs employ various tokenization strategies to convert gene expression profiles into structured model inputs:

  • Rank-based discretization: Used by Geneformer and LangCell, this approach orders genes by their expression levels within each cell, creating a deterministic sequence based on expression magnitude [1] [12].
  • Bin-based discretization: Employed by scBERT and scGPT, this method groups expression values into predefined bins, balancing resolution with computational efficiency [1] [12].
  • Value projection: Implemented in scFoundation, this technique projects continuous expression values into embedding spaces without discrete categorization [12].

These tokenization approaches are combined with specialized embeddings for gene identifiers, expression values, and positional information to create comprehensive input representations that preserve biological meaning while conforming to architectural requirements of transformer models [16].

G cluster_input Input Single-Cell Data cluster_tokenization Tokenization Strategies cluster_embeddings Embedding Components RawData Gene Expression Matrix RankBased Rank-Based (Genes ordered by expression) RawData->RankBased BinBased Bin-Based (Expression values binned) RawData->BinBased ValueProjection Value Projection (Continuous embeddings) RawData->ValueProjection GeneEmbed Gene Identity Embedding RankBased->GeneEmbed BinBased->GeneEmbed ValueProjection->GeneEmbed ValueEmbed Expression Value Embedding GeneEmbed->ValueEmbed PositionEmbed Positional Embedding ValueEmbed->PositionEmbed ModelInput Model Input Sequence PositionEmbed->ModelInput

Quantitative Performance Benchmarking

Zero-Shot Performance Evaluation

Rigorous evaluation of scFMs in zero-shot settings is essential for assessing their true potential in biological discovery. Recent benchmarking studies have revealed significant limitations in current models when deployed without task-specific fine-tuning. In critical tasks such as cell type clustering and batch integration, popular scFMs including Geneformer and scGPT have been consistently outperformed by simpler traditional methods [4] [3].

Table 2: Zero-Shot Performance Comparison Across Biological Tasks

Model Cell Type Clustering (AvgBIO Score) Batch Integration (iLISI Score) Perturbation Analysis Biological Insight Capture
scGPT Variable performance; outperforms baselines on PBMC dataset only Moderate; better on complex biological batches Limited data Shows promise in gene network inference
Geneformer Consistently outperformed by simpler methods Poor; often increases batch effects Limited data Demonstrates some gene relationship capture
scVI Strong performance across multiple datasets Excellent on technical batches Strong performance Established reliable baseline
Harmony Competitive cell type separation Excellent batch mixing Not specialized for perturbations Not designed for deep biological insights
HVG Selection Surprisingly effective; often outperforms scFMs Best overall batch integration scores Simple but effective Limited to variance-based features

In cell type clustering tasks, both Geneformer and scGPT underperformed compared to established methods like Harmony and scVI, as measured by Average BIO (AvgBio) score across multiple datasets [4]. Notably, the simple approach of selecting Highly Variable Genes (HVG) frequently outperformed both foundation models, raising questions about the effectiveness of their pretraining paradigms [4]. For batch integration—a crucial task for combining datasets from different experimental sources—Geneformer particularly struggled, with its embeddings often showing stronger batch effects than the original input data [4].

Performance Across Experimentally Relevant Tasks

Beyond standard benchmarks, scFMs have been evaluated on biologically and clinically relevant tasks including cancer cell identification, drug sensitivity prediction, and cross-tissue analysis. These evaluations reveal a nuanced landscape where no single model consistently outperforms others across all tasks [16]. The performance varies significantly based on factors such as dataset size, tissue type, and specific biological questions, emphasizing the importance of task-specific model selection.

Specialized evaluation metrics like scGraph-OntoRWR (which measures consistency between model-derived cell relationships and established biological knowledge) and Lowest Common Ancestor Distance (which quantifies the severity of cell type misannotation errors) provide deeper insights into the biological relevance of scFM embeddings [16]. These knowledge-based evaluation approaches demonstrate that pretrained scFM embeddings do capture meaningful biological information about gene and cell relationships, even when their performance on specific tasks may lag behind simpler methods [16].

Experimental Protocols for scFM Evaluation

Standardized Zero-Shot Evaluation Workflow

To ensure reproducible assessment of scFM performance, researchers should follow a standardized protocol for zero-shot evaluation. The following workflow outlines key steps for benchmarking models on novel datasets:

Protocol 1: Zero-Shot Cell Type Clustering

  • Data Preparation: Obtain a holdout dataset not included in the model's pretraining corpus. Standard quality control should be applied, including filtering low-quality cells and genes, without batch correction.
  • Embedding Generation: Process the dataset through the target scFM without any fine-tuning to extract cell embeddings.
  • Dimensionality Reduction: Apply standard techniques (UMAP, t-SNE) to the embeddings for visualization.
  • Clustering: Perform Leiden or Louvain clustering on the embeddings without using biological labels.
  • Evaluation: Calculate clustering metrics including Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), and silhouette scores against known cell type labels.
  • Comparison: Benchmark against established baselines including HVG selection, scVI, and Harmony embeddings using identical evaluation metrics.

Protocol 2: Batch Integration Assessment

  • Dataset Selection: Choose datasets with known batch effects from different experimental technologies or laboratories.
  • Embedding Extraction: Generate zero-shot embeddings using the target scFM.
  • Batch Mixing Quantification: Calculate batch integration metrics including iLISI (Integration Local Inverse Simpson's Index) and silhouette batch scores.
  • Biological Conservation Evaluation: Assess whether batch correction preserves biological variation using metrics such as bio-conservation scores and label classification accuracy.
  • Comparative Analysis: Compare against standard batch correction methods to determine relative performance.

G cluster_protocol Zero-Shot Evaluation Workflow DataPrep Data Preparation (Holdout dataset, QC filtering) EmbedGen Embedding Generation (Zero-shot model inference) DataPrep->EmbedGen DimRed Dimensionality Reduction (UMAP, t-SNE) EmbedGen->DimRed Cluster Clustering (Leiden, Louvain) DimRed->Cluster Eval Evaluation (NMI, ARI, Silhouette) Cluster->Eval Compare Comparative Analysis (vs. HVG, scVI, Harmony) Eval->Compare

Interpretation and Biological Validation

Beyond quantitative metrics, biological validation is crucial for establishing the practical utility of scFMs. Researchers should incorporate:

  • Differential Expression Analysis: Verify that cluster markers derived from scFM embeddings correspond to biologically meaningful gene signatures.
  • Cell Type Annotation Accuracy: Assess whether model embeddings enable correct identification of known cell types, particularly for rare populations.
  • Functional Enrichment: Perform gene ontology enrichment on genes most influential in the model's attention patterns to identify biologically relevant pathways.
  • Stability Analysis: Evaluate consistency of results across different random seeds and dataset subsamples to ensure robustness.

Essential Research Toolkit

Implementing and evaluating scFMs requires specialized computational resources and software tools. The following toolkit outlines essential components for researchers working with single-cell foundation models:

Table 3: Essential Research Toolkit for scFM Implementation

Tool/Resource Function Application in scFM Research
CELLxGENE Census Unified data resource Access to standardized single-cell data for training and evaluation
BioLLM Framework Unified model interface Standardized APIs for multiple scFMs; benchmarking support
scib-metrics Standardized benchmarking metrics Computation of bio-conservation and batch correction metrics
Scanpy Single-cell analysis Preprocessing, visualization, and integration with model embeddings
Hugging Face Transformers Model architecture library Adaptation of transformer architectures for biological data
scGPT Implementation Pretrained models and training code Access to scGPT model weights and fine-tuning pipelines
Geneformer Model Pretrained rank-based model Geneformer embeddings and transfer learning capabilities

The CELLxGENE platform provides access to over 100 million curated single cells, serving as a vital resource for both pretraining and evaluation [1] [19]. For standardized model comparison, the BioLLM framework offers unified APIs that eliminate architectural and coding inconsistencies, enabling direct performance comparisons across different scFMs [17]. Established single-cell analysis toolkits like Scanpy complement these specialized resources by providing robust preprocessing and visualization capabilities that integrate with scFM-derived embeddings.

The development of single-cell foundation models represents a promising frontier in computational biology, but significant challenges remain. Current evaluations indicate that these models have not yet consistently realized their potential for zero-shot biological discovery, with simpler methods often outperforming complex foundation models on critical tasks [4] [20]. This performance gap highlights fundamental questions about current pretraining approaches and whether masked language modeling objectives effectively capture the biological knowledge needed for generalized reasoning.

Future progress in scFMs will likely require innovations in several key areas. Architecturally, emerging approaches like GeneMamba's state space models offer promising alternatives to transformer-based architectures, potentially addressing computational efficiency limitations while maintaining performance [12]. Pretraining strategies may need fundamental rethinking to better align objectives with biological reasoning, potentially incorporating more explicit biological knowledge through gene networks, pathways, or ontological relationships. Evaluation standards must continue to evolve beyond technical metrics to assess true biological insight, possibly through carefully designed challenges that test models on novel biological predictions with experimental validation.

For researchers applying these tools, current evidence suggests a pragmatic approach: scFMs show considerable promise as components in biological discovery pipelines, but their limitations in zero-shot settings necessitate careful validation and comparison with established methods. As the field matures, the development of more robust evaluation frameworks and specialized architectures may eventually fulfill the promise of foundation models to transform our understanding of cellular biology.

Practical Applications and Methodological Advances in Zero-Shot scFM Deployment

Single-cell foundation models (scFMs) are machine learning models pretrained on massive-scale single-cell datasets, with the goal of capturing universal biological patterns. A critical assessment of these models involves zero-shot evaluation, where the model's internal representation of input data—an "embedding"—is used for downstream analysis with no further task-specific training. This is particularly vital in exploratory biological contexts where predefined labels are unavailable, making fine-tuning infeasible. The core promise of scFMs is their ability to generate robust cell embeddings that project noisy gene expression measurements into a more biologically relevant latent space, ready for immediate use in key atlas construction tasks without additional adaptation.

Recent rigorous evaluations, however, suggest that this promise remains partially fulfilled. Kedzierska et al. (2025) report that in zero-shot settings, proposed foundation models like Geneformer and scGPT can, in some cases, be outperformed by simpler methods on standard tasks including cell type clustering and batch integration. These findings underscore the importance of robust zero-shot benchmarking as an essential step in the development and deployment of foundation models for single-cell biology, highlighting the current gap between model scale and reliable biological insight in discovery settings.

Core Zero-Shot Tasks in Atlas Construction

Task 1: Cell Type Clustering

Objective: To evaluate whether a foundation model's cell embeddings can effectively separate known cell types in an unseen dataset without any model fine-tuning. This tests the model's fundamental ability to encode biologically meaningful cell states.

Quantitative Performance Benchmark:

Performance is typically measured by the Average BIO (AvgBIO) score and Average Silhouette Width (ASW), which quantify the separation between known cell types in the embedding space. The following table summarizes the zero-shot performance of selected models against established baselines across multiple datasets, as reported by Kedzierska et al.:

Table 1: Zero-shot cell type clustering performance (AvgBIO score) across datasets

Model / Method Pancreas Dataset PBMC (12k) Dataset Tabula Sapiens Immune Dataset
HVG (Baseline) 0.741 0.785 0.792 0.801
Harmony 0.752 0.791 0.805 0.812
scVI 0.768 0.779 0.798 0.809
scGPT 0.702 0.802 0.754 0.721
Geneformer 0.635 0.691 0.668 0.645

Source: Adapted from Kedzierska et al. [4]

Key Findings: The evaluation reveals that selecting Highly Variable Genes (HVG) often outperforms both scGPT and Geneformer across most metrics. While scGPT shows competitive performance on the PBMC dataset, its performance is inconsistent across other tissues. Geneformer consistently underperforms relative to all baselines. This suggests that the masked language model pretraining framework may not inherently produce cell embeddings that are optimal for cell type separation without task-specific fine-tuning.

Task 2: Batch Integration

Objective: To assess a model's capacity to eliminate non-biological technical variations (batch effects) across multiple data sources while preserving meaningful biological differences. Success in this task is crucial for building integrated atlases from multiple studies.

Quantitative Performance Benchmark:

Batch integration quality is evaluated using metrics that balance batch mixing (e.g., LISI score) and biological conservation (e.g., PCR score). The following table provides a comparative analysis:

Table 2: Batch integration performance across methods

Model / Method Batch Mixing Score (LISI, higher is better) Biological Conservation (PCR, lower is better) Overcorrection Sensitivity
HVG 0.892 0.124 Low
Harmony 0.865 0.135 Medium
scVI 0.879 0.141 Medium
scGPT 0.831 0.152 Not Reported
Geneformer 0.745 0.218 Not Reported
RBET Framework 0.901* 0.118* High

Note: *RBET values are illustrative based on its reported superior performance [21]. LISI: Local Inverse Simpson's Index; PCR: Principal Component Regression.

Key Findings: Geneformer's embeddings consistently show a higher proportion of variance explained by batch effects compared to the original data, indicating inadequate batch mixing. scGPT demonstrates variable performance, outperforming scVI and Harmony on complex datasets with combined technical and biological batch effects but underperforming on datasets with purely technical variation. The recently proposed RBET framework shows particular promise due to its sensitivity to overcorrection, a critical feature for preserving biological signal [21].

Experimental Protocols for Zero-Shot Evaluation

Protocol for Cell Type Clustering Evaluation

Required Inputs:

  • Preprocessed single-cell RNA-seq dataset (query dataset) with held-out cell type labels
  • Pretrained foundation model (e.g., scGPT, Geneformer) with frozen weights
  • Baseline methods (HVG, scVI, Harmony) for comparison

Procedure:

  • Embedding Generation: Pass the normalized count matrix of the query dataset through the foundation model to extract cell embeddings in a zero-shot manner (no gradient updates).
  • Dimensionality Reduction: Apply PCA to the embeddings, followed by UMAP for visualization (2D/3D).
  • Clustering: Perform Leiden clustering on the k-nearest neighbor graph constructed from the embeddings.
  • Evaluation: Compare cluster labels against ground truth cell type annotations using:
    • Average BIO score
    • Adjusted Rand Index (ARI)
    • Normalized Mutual Information (NMI)
  • Benchmarking: Repeat steps 1-4 for all baseline methods and compare scores.

Critical Controls:

  • Ensure no data leakage between pretraining and evaluation datasets
  • Use identical preprocessing pipelines for all methods
  • Apply multiple random seeds to assess stability

clustering_protocol Input Dataset Input Dataset Generate Embeddings Generate Embeddings Input Dataset->Generate Embeddings Pretrained Model Pretrained Model Pretrained Model->Generate Embeddings Dimensionality Reduction Dimensionality Reduction Generate Embeddings->Dimensionality Reduction Cell Clustering Cell Clustering Dimensionality Reduction->Cell Clustering Performance Evaluation Performance Evaluation Cell Clustering->Performance Evaluation Comparative Analysis Comparative Analysis Performance Evaluation->Comparative Analysis

Figure 1: Workflow for zero-shot cell type clustering evaluation

Protocol for Batch Integration Evaluation

Required Inputs:

  • Multi-batch single-cell dataset with known technical and biological covariates
  • Pretrained foundation model
  • Reference genes with stable expression patterns across cell types

Procedure:

  • Embedding Extraction: Generate cell embeddings for the multi-batch dataset using the foundation model in zero-shot mode.
  • Visual Assessment: Create UMAP plots colored by batch and cell type to qualitatively assess integration.
  • Quantitative Metrics:
    • Calculate batch mixing scores (LISI, kBET)
    • Compute biological conservation metrics (PCR, cell type ASW)
    • Apply RBET framework using reference genes to detect overcorrection [21]
  • Differential Expression Analysis: Perform differential expression testing between batches post-integration to identify residual technical effects.
  • Downstream Validation: Assess impact on downstream tasks like trajectory inference and cell-cell communication.

Advanced Consideration - Disentanglement Models: For methods like scShift and CODAL that explicitly disentangle biological and technical variations [22] [23]:

  • The batch-dependent variation (biological embedding) captures disease states and perturbations
  • The batch-independent variation (unperturbed embedding) represents core cell type information
  • Evaluate the identifiability of both components on held-out datasets

batch_protocol Multi-Batch Data Multi-Batch Data Embedding Generation Embedding Generation Multi-Batch Data->Embedding Generation Foundation Model Foundation Model Foundation Model->Embedding Generation Reference Genes Reference Genes Overcorrection Check Overcorrection Check Reference Genes->Overcorrection Check Visual Assessment Visual Assessment Embedding Generation->Visual Assessment Metric Calculation Metric Calculation Embedding Generation->Metric Calculation Metric Calculation->Overcorrection Check Downstream Validation Downstream Validation Overcorrection Check->Downstream Validation

Figure 2: Workflow for zero-shot batch integration evaluation

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key computational tools and resources for zero-shot evaluation

Tool/Resource Type Primary Function Application in Zero-Shot Tasks
CELLxGENE Census Data Resource Curated single-cell data repository Source of standardized evaluation datasets; enables cross-study comparisons
HVG Selection Computational Method Feature selection based on variance Simple yet powerful baseline for cell type clustering and batch correction
RBET Framework Evaluation Metric Reference-informed batch effect testing Detects overcorrection with sensitivity to biological variation preservation [21]
scIB Metrics Evaluation Suite Comprehensive integration benchmarking Standardized metrics for batch mixing and bio-conservation (ASW, ARI, NMI)
scShift Disentanglement Model Separates batch and biological variations Enables zero-shot biological state representation without annotations [22]
CODAL Integration Model Mutual information-based disentanglement Addresses batch-confounded cell states through variational inference [23]
CellWhisperer Multimodal Model Joint embedding of transcriptomes and text Facilitates zero-shot cell annotation through natural language queries [24]

Emerging Capabilities and Future Directions

The field of zero-shot evaluation is rapidly evolving beyond basic clustering and integration. Novel approaches are demonstrating emergent capabilities that may shape future atlas construction protocols:

Biological State Disentanglement: Models like scShift show that scaling deep identifiable models enables zero-shot revelation of biological states. When trained on diverse compendiums of scRNA-seq atlases, these models can disentangle batch-dependent and independent variations, allowing direct comparison of biological states across datasets without additional training [22].

Multimodal Integration: Approaches like CellWhisperer establish multimodal embeddings connecting transcriptomes with textual annotations, enabling zero-shot prediction of cell types and biological functions through natural language queries [24]. This represents a paradigm shift from predefined classification schemas to flexible, knowledge-informed cell annotation.

Scaling Laws: Systematic evaluation of over 200 scShift models reveals emergent zero-shot capabilities beyond a transition threshold with respect to dataset diversity and size [22]. This suggests that, similar to large language models, single-cell foundation models may exhibit qualitatively improved capabilities when trained at sufficient scale.

These advances point toward a future where zero-shot evaluation will encompass not just technical performance metrics, but also the ability of models to capture meaningful biological relationships, generalize to novel cell states, and integrate multimodal information for holistic cell atlas construction.

Zero-shot learning represents a paradigm shift in machine learning, enabling models to recognize or classify data from categories they have never explicitly encountered during training [25]. Within the domain of single-cell biology, this capability is being advanced by single-cell foundation models (scFMs), which are large-scale neural networks pretrained on massive, diverse datasets of single-cell transcriptomics information [26] [2]. These models learn a foundational understanding of cellular biology by identifying universal patterns in gene expression. The emergent ability to perform tasks without additional task-specific training (zero-shot) is critical for drug discovery, as it allows researchers to predict how cells will respond to novel therapeutic compounds or under new experimental conditions where pre-existing labels are unavailable [4]. This protocol details the application of scFMs for the zero-shot prediction of cellular responses to novel drugs, a process poised to accelerate therapeutic development and personalized medicine.

Key Concepts and Foundation Models

Core Principles of Zero-Shot Prediction

In the context of single-cell data, zero-shot prediction operates by leveraging the semantic knowledge that scFMs acquire during pretraining. A model learns to map high-dimensional, sparse single-cell RNA sequencing (scRNA-seq) data into a meaningful latent space where cells with similar biological functions and states are positioned proximally [2]. When presented with a novel drug—a "class" not seen during training—the model does not rely on pre-learned drug-specific patterns. Instead, it leverages its generalized understanding of cellular biology to infer the potential relationship between the cell's baseline state and the expected phenotypic outcome, such as sensitivity or resistance [4] [25].

Landscape of Single-Cell Foundation Models

Several scFMs form the backbone of current zero-shot prediction research. The table below summarizes key models and their relevance to drug response tasks.

Table 1: Foundational Models for Single-Cell Analysis

Model Name Key Architectural Features Pretraining Corpus Demonstrated Relevance to Drug Response
scGPT [26] [17] Transformer-based; utilizes masked gene modeling. Over 33 million non-cancerous human cells. Robust performance across diverse tasks including perturbation prediction; can be fine-tuned for drug response.
Geneformer [4] [2] Transformer-based; uses rank-based gene tokenization. ~30 million single-cell transcriptomes from various tissues. Used for predicting disease-associated network dynamics and perturbation effects.
Nicheformer [27] Transformer-based; integrates dissociated and spatial transcriptomics. 110 million cells (57M dissociated, 53M spatial). Captures spatial context, enabling predictions about the tissue microenvironment's role in drug response.
PharmaFormer [28] Custom Transformer; integrates gene expression and drug SMILES structures. GDSC database (900+ cell lines, 100+ drugs). Specifically designed for clinical drug response prediction via transfer learning from cell lines to organoids.

Application Notes: Protocols for Zero-Shot Prediction

This section provides a detailed, step-by-step protocol for leveraging scFMs to predict cellular responses to novel drugs in a zero-shot setting.

Protocol 1: Zero-Shot Cell Embedding for Response Stratification

Objective: To identify subpopulations of cells within a tumor that may exhibit innate sensitivity or resistance to a novel drug based solely on their pre-treatment transcriptomic state.

Materials:

  • Input Data: Pre-treatment scRNA-seq count matrix from a patient-derived sample.
  • Foundation Model: A pretrained scFM (e.g., scGPT, Geneformer) with published weights.
  • Computational Environment: High-performance computing cluster with GPU acceleration and Python environment (e.g., PyTorch, JAX).
  • Software Tools: Unified frameworks like BioLLM [17] can streamline model access and standardize APIs.

Methodology:

  • Data Preprocessing: Prepare your query scRNA-seq data. This includes standard quality control (filtering low-quality cells and genes), normalization, and log-transformation. Ensure the gene identifiers align with the vocabulary used during the scFM's pretraining.
  • Zero-Shot Embedding Generation: Pass the preprocessed single-cell data through the frozen, pretrained scFM to generate cell embeddings. This step is crucial and must be performed without any fine-tuning of the model on the new data.

  • Dimensionality Reduction and Clustering: Apply techniques like UMAP or t-SNE to the high-dimensional cell embeddings for visualization. Subsequently, use clustering algorithms (e.g., Leiden, Louvain) to identify distinct cell subpopulations.
  • Interpretation and Hypothesis Generation: Analyze the resulting clusters. Cells clustering together in the embedding space share similar biological states learned by the foundation model. Correlate these states with known markers of drug sensitivity or resistance. For instance, a cluster enriched for oxidative phosphorylation may suggest sensitivity to metabolic inhibitors, while a cluster with high expression of ABC transporters may indicate potential for multidrug resistance [29] [2].

Protocol 2: In-silico Perturbation with Novel Drug Signatures

Objective: To simulate the transcriptional effect of a novel drug on a cell population and predict the outcome.

Materials:

  • Input Data: As in Protocol 1.
  • Foundation Model: A model like scGPT or Geneformer, known for its perturbation prediction capabilities [26].
  • Drug Signature: A representative gene expression signature for the novel drug. This can be derived from public databases (e.g., LINCS L1000) or from bulk RNA-seq experiments on model systems treated with the drug.

Methodology:

  • Define the Perturbation Vector: The novel drug's effect is represented as a "perturbation vector" in the model's latent or input space. This vector encodes the directional change in gene expression that the drug typically induces.
  • In-silico Perturbation: For each cell in your query dataset, the model computationally applies the perturbation vector to its original state, generating a "predicted post-treatment" embedding.

  • Trajectory Analysis: Compare the original and predicted post-treatment embeddings for each cell. Tools like UMAP can visualize the "trajectory" a cell is predicted to take upon treatment. Cells that show a large shift in embedding space are predicted to be strongly affected by the drug.
  • Outcome Prediction: The model can be tasked with predicting a specific outcome, such as cell viability or apoptosis. The distance a cell travels in the embedding space or the direction of its trajectory can be quantified and used to score its predicted sensitivity (large change) or resistance (minimal change) [29] [27].

The Scientist's Toolkit

Table 2: Essential Research Reagents and Computational Tools

Item / Resource Function / Description Example Sources / Tools
Pretrained Foundation Models Provides the core AI for generating zero-shot predictions. scGPT, Geneformer, Nicheformer, scFoundation [26] [4] [27].
Unified Software Framework Standardizes access to different models, enabling fair benchmarking and streamlined workflows. BioLLM [17].
Single-Cell Datasets Provides the input data for prediction; requires high-quality, annotated pre- and post-treatment data for validation. CCLE, GDSC, patient-derived organoid data [29] [28].
Batch Integration Tools Corrects for technical variation between datasets, a critical step for robust model application. Harmony, scVI [4] [2].
Gene Ontology Databases Provides the biological context for interpreting model outputs and identified gene patterns. Gene Ontology (GO) resources [2].

Experimental Workflow and Validation

The following diagram illustrates the logical flow of a zero-shot prediction experiment, from data input to biological validation.

G Start Input: Pre-treatment scRNA-seq Data PP Data Preprocessing & Quality Control Start->PP FM Foundation Model (Zero-Shot Embedding) PP->FM C1 Cluster 1 (Potentially Sensitive) FM->C1 C2 Cluster 2 (Potentially Resistant) FM->C2 Val Experimental Validation (e.g., In-vitro Assays) C1->Val C2->Val App Application: Identify Resistance Mechanisms Guide Combination Therapy Val->App

Figure 1: Zero-shot prediction workflow for novel drug response.

Performance Benchmarks and Validation

Rigorous evaluation is essential, as zero-shot performance of scFMs can be variable. Independent benchmarks reveal that while scFMs show promise, they do not always consistently outperform simpler baseline methods like Highly Variable Genes (HVG) selection or specialized models like scVI and Harmony on tasks like cell type clustering and batch correction [4] [2].

Table 3: Example Benchmarking Results for Zero-Shot Cell Embeddings (Adapted from [4] [2])

Model / Method AvgBIO Score (Cell Type Clustering) Batch Integration Score (Pancreas Dataset) Performance Notes
HVG (Baseline) 0.79 0.88 Often outperforms foundation models in zero-shot clustering and integration tasks [4].
scVI (Baseline) 0.75 0.85 Robust performance on technical batch effects [4].
Harmony (Baseline) 0.73 0.72 Struggles with complex biological batch effects (e.g., donor variation) [4] [2].
scGPT (Zero-Shot) 0.68 0.78 Shows robust performance across tasks but is inconsistent; benefits from large-scale pretraining [4] [17].
Geneformer (Zero-Shot) 0.62 0.45 Underperforms baselines in batch integration; embeddings may be dominated by batch effects [4].

Validation requires correlating computational predictions with empirical data. For the ATSDP-NET model (which uses transfer learning, not pure zero-shot), high correlations were found between predicted gene scores and actual outcomes (sensitivity: R=0.888, p<0.001; resistance: R=0.788, p<0.001) [29] [30]. Similarly, PharmaFormer demonstrated clinical relevance by stratifying patients into risk groups with significantly different survival outcomes after fine-tuning on organoid data (e.g., Hazard Ratio for oxaliplatin in colon cancer: 4.49) [28]. These results underscore the potential value of these approaches, even as pure zero-shot capabilities continue to mature.

In single-cell genomics, the emergence of single-cell foundation models (scFMs) pretrained on tens of millions of cells has created new paradigms for biological discovery [1]. These models learn universal representations of cellular states by capturing complex gene-gene interactions and regulatory networks, offering immense potential for downstream tasks like drug response prediction [18] [31]. However, a significant challenge persists: adapting these massive models to specialized tasks with limited labeled data while preserving their generalizable biological knowledge.

Adapter-based fine-tuning has emerged as a powerful solution to this challenge, enabling parameter-efficient adaptation of scFMs. By inserting small, trainable modules into frozen pretrained models, adapters allow specialization for molecular perturbation prediction and other tasks while retaining the rich biological representations learned during pretraining [18] [31] [32]. This approach is particularly valuable for few-shot and zero-shot learning scenarios common in biomedical research, where experimental data for novel drugs or cell lines is extremely limited.

The Adapter Paradigm in Machine Learning

Adapter-based fine-tuning represents a parameter-efficient alternative to full model fine-tuning. Instead of updating all parameters of a pretrained foundation model, this approach inserts small, trainable adapter modules between the model's frozen layers [32]. A canonical adapter employs a bottleneck structure that first down-projects the input dimensionality, applies a non-linear activation, then up-projects back to the original dimension, with a skip connection preserving the original representations: h′ = W_up(σ(W_down h)) + h [32].

This design provides multiple advantages: it dramatically reduces the number of trainable parameters (often to less than 1-5% of the original model), minimizes catastrophic forgetting of pretrained knowledge, enables modular multi-task learning, and significantly reduces storage requirements by sharing the same backbone across tasks [31] [32]. The efficiency of adapters has been demonstrated across domains including natural language processing, computer vision, and speech recognition, where they often match or exceed full fine-tuning performance despite their minimal parameter count [32].

Adapter Architectures for Single-Cell Foundation Models

scDCA: Drug-Conditional Adapters for Perturbation Prediction

The Single-Cell Drug-Conditional Adapter (scDCA) represents a specialized architecture for molecular perturbation prediction. This approach introduces drug-conditional adapter layers that inject molecular structure information into frozen scFMs while training less than 1% of the original model parameters [18] [31]. The adapter parameters are dynamically conditioned on chemical structures, enabling the model to predict transcriptional responses to novel drugs and generalize zero-shot to unseen cell lines [31].

Table: scDCA Performance on Molecular Perturbation Prediction

Generalization Task Performance Improvement Key Achievement
Novel Drug Prediction State-of-the-art results Significant improvement over existing baselines
Unseen Cell Line Prediction Major improvements Successful zero-shot generalization
Few-shot Scenarios Strong performance Effective with limited training data

Attn-Adapter: Dual Attention Mechanism

The Attn-Adapter architecture employs a dual attention mechanism to enhance few-shot learning capabilities. It consists of two key components: a Memory Attn-Adapter that refines category embeddings using support examples through cross-attention, and a Local-Global Attn-Adapter that enriches image embeddings by integrating local and global features [33]. This design enables dynamic adaptation from a few labeled samples without retraining the base model, outperforming state-of-the-art methods in cross-category and cross-dataset generalization [33].

Experimental Protocols for Adapter Implementation

Protocol: Implementing scDCA for Drug Response Prediction

Objective: Adapt a single-cell foundation model (e.g., scGPT) to predict transcriptional responses to novel drugs using drug-conditional adapters.

Materials:

  • Pretrained scFM (e.g., scGPT with 50M parameters pretrained on 33M cells)
  • Chemical perturbation dataset (e.g., with 100+ molecules across multiple cell lines)
  • Adapter framework (PyTorch or TensorFlow)
  • GPU resources (recommended: 16GB+ VRAM)

Procedure:

  • Model Preparation: Load a pretrained scFM and freeze all its parameters.
  • Adapter Insertion: Insert drug-conditional adapter layers after transformer blocks. Each adapter should implement:
    • Down-projection to reduced dimension (e.g., 64D from original 512D)
    • Non-linear activation (ReLU)
    • Up-projection to original dimension
    • Skip connection
  • Drug Conditioning: Implement a molecular structure encoder (e.g., using graph neural networks or molecular fingerprints) to generate conditional parameters for the adapter layers.
  • Training: Train only adapter parameters using:
    • Objective: Mean squared error between predicted and actual gene expression
    • Batch size: 32-128 (adjust based on GPU memory)
    • Learning rate: 1e-4 to 1e-3 with linear decay
    • Epochs: 50-100 with early stopping
  • Evaluation: Assess performance on held-out drugs and cell lines using metrics like mean squared error, Pearson correlation, and zero-shot accuracy.

Expected Outcomes: The adapted model should achieve state-of-the-art performance in predicting cellular responses to novel drugs and demonstrate zero-shot generalization to unseen cell lines, outperforming methods like ChemCPA and Biolord [31].

Protocol: Few-Shot Adaptation with Attn-Adapter

Objective: Adapt a vision-language model for few-shot classification in biological imaging contexts.

Materials:

  • Pretrained VLM (e.g., CLIP)
  • Few-shot support set (typically 1-16 samples per class)
  • Attn-Adapter implementation

Procedure:

  • Feature Extraction: Extract support embeddings and category embeddings using the frozen base model.
  • Memory Attn-Adapter: Apply cross-attention to refine category embeddings using support embeddings as keys and values.
  • Local-Global Attn-Adapter: Enhance image embeddings by integrating local and global features through attention mechanisms.
  • Similarity Computation: Calculate cosine similarity between refined category and image embeddings for classification.

Validation: Test cross-category and cross-dataset generalization, comparing against Tip-Adapter and Meta-Adapter baselines [33].

Performance Evaluation and Benchmarking

Quantitative Performance of Adapter Methods

Table: Adapter Performance Across Domains

Domain Parameter Efficiency Performance vs. Full Fine-tuning Key Applications
Natural Language Processing 0.6-6% of parameters Outperforms by 0.7-2.5% in low-resource settings Sentiment analysis, QA, NLI
Computer Vision 2-5% of parameters Exceeds by 1% AP on instance segmentation Object detection, classification
Speech Translation ~7% of parameters BLEU improvements of +1.1 on low-resource pairs Multi-speaker adaptation
Single-Cell Biology <1% of parameters State-of-the-art in perturbation prediction Drug response, novel cell line generalization

Adapter-based approaches consistently demonstrate competitive performance while maintaining parameter efficiency. In single-cell biology, scDCA enables significant improvements in few-shot and zero-shot generalization to new cell lines compared to existing baselines [18] [31]. The method establishes new state-of-the-art results across generalization tasks, particularly for the challenging scenario of predicting perturbations for unseen cell lines.

Zero-Shot Capabilities of Adapted Models

Rigorous evaluation of zero-shot performance is crucial for assessing true generalization capabilities. Studies reveal that scFMs like scGPT and Geneformer face challenges in zero-shot settings, sometimes underperforming simpler methods like highly variable genes selection on tasks like cell type clustering and batch integration [4]. However, adapter-based fine-tuning significantly enhances zero-shot capabilities by preserving the model's foundational knowledge while enabling adaptation to novel concepts [18] [31].

Benchmarking studies show that while no single scFM consistently outperforms others across all tasks, models with adapter-based tuning demonstrate more robust generalization [16]. Comprehensive evaluations across multiple cell-level tasks reveal that adapter-enhanced models capture biological relationships more effectively, as measured by ontology-informed metrics like scGraph-OntoRWR [16].

The Scientist's Toolkit

Table: Essential Research Reagents for Adapter Implementation

Reagent / Tool Function Example Implementation
Single-Cell Foundation Models Provides pretrained biological representations scGPT (50M params, pretrained on 33M cells) [1]
Adapter Modules Enables parameter-efficient fine-tuning Bottleneck layers with down/up-projection [32]
Molecular Encoders Bridges chemical and biological modalities Graph neural networks for molecular structures [31]
Few-Shot Support Sets Provides limited labeled examples 1-16 samples per class for adaptation [33]
Benchmark Datasets Evaluates generalization capabilities Chemical perturbation data with novel drugs/cell lines [18]
Unified Frameworks Standardizes model integration and evaluation BioLLM for consistent API access to multiple scFMs [17]

Visualizing Experimental Workflows

scDCA PretrainedScFM Pretrained Single-Cell Foundation Model FrozenWeights Frozen Weights PretrainedScFM->FrozenWeights AdapterModule Drug-Conditional Adapter Layers FrozenWeights->AdapterModule Feature Extraction DrugInput Molecular Structure Input DrugInput->AdapterModule Conditions Parameters ResponseOutput Predicted Transcriptional Response AdapterModule->ResponseOutput CellInput Single-Cell Expression Input CellInput->PretrainedScFM

Diagram 1: scDCA workflow showing how drug information conditions adapter parameters to predict transcriptional responses using a frozen single-cell foundation model.

AttnAdapter SupportSet Few-Shot Support Samples MemoryAdapter Memory Attn-Adapter SupportSet->MemoryAdapter CategoryEmbeddings Category Embeddings CategoryEmbeddings->MemoryAdapter RefinedCategories Refined Category Embeddings MemoryAdapter->RefinedCategories LocalGlobalAdapter Local-Global Attn-Adapter EnhancedImages Enhanced Image Embeddings LocalGlobalAdapter->EnhancedImages Prediction Few-Shot Prediction RefinedCategories->Prediction EnhancedImages->Prediction

Diagram 2: Attn-Adapter architecture demonstrating how dual attention mechanisms refine both category and image embeddings for few-shot learning.

Adapter-based fine-tuning represents a transformative approach for adapting single-cell foundation models to specialized tasks with limited data. The strategic insertion of minimal trainable parameters enables remarkable efficiency while preserving valuable biological knowledge acquired during pretraining. As the field advances, innovations in dynamic routing, conditional adaptation, and hierarchical designs will further enhance the capabilities of adapter-based methods. For researchers in drug discovery and cellular biology, these techniques offer powerful tools to leverage the full potential of foundation models while accommodating the data constraints inherent in biomedical research.

The advent of single-cell genomics has revolutionized our ability to investigate biological systems at unprecedented resolution, revealing profound cellular heterogeneity in development, physiology, and disease. While single-cell RNA sequencing (scRNA-seq) has been the workhorse of this revolution, biological systems operate through complex, multilayered regulatory mechanisms that span multiple molecular modalities and are spatially organized within tissues. The emergence of single-cell multi-omics technologies now enables the simultaneous profiling of different data modalities—including transcriptomics, epigenomics, proteomics, and spatial context—within the same cell, providing a more comprehensive picture of cellular identity and function.

Concurrently, single-cell foundation models (scFMs) have emerged as powerful computational frameworks capable of learning universal representations from massive-scale single-cell data. These models, typically built on transformer architectures and pretrained on millions of cells through self-supervised objectives, have demonstrated remarkable capabilities in adapting to various downstream tasks with minimal fine-tuning. However, a significant challenge remains: most existing scFMs have primarily focused on transcriptomic data alone, limiting their ability to capture the full complexity of biological systems.

This application note explores cutting-edge computational strategies for integrating multi-omic and spatial data modalities within the framework of zero-shot learning single-cell foundation models. We provide detailed protocols and analytical frameworks that enable researchers to move beyond transcriptomics and leverage the full potential of multimodal single-cell data, with particular emphasis on clinical and drug development applications.

Foundations of Single-Cell Foundation Models

Core Architectural Principles

Single-cell foundation models are large-scale deep learning models pretrained on vast datasets that can be adapted to a wide range of downstream tasks through self-supervised learning [1]. These models share three key components that enable their generalization capabilities:

  • Large-scale pretraining: scFMs are trained on extremely large and diverse datasets to capture universal biological patterns. Public archives such as CZ CELLxGENE provide unified access to annotated single-cell datasets with over 100 million unique cells standardized for analysis [1].

  • Transformer architectures: Most scFMs utilize transformer architectures with attention mechanisms that allow the model to learn and weight relationships between any pair of input tokens (typically genes or genomic features) [1]. These architectures can be encoder-based (e.g., BERT-like), decoder-based (e.g., GPT-like), or hybrid designs.

  • Adaptation mechanisms: scFMs can be fine-tuned or prompted for new tasks, transferring learned knowledge to improve performance on target tasks with relatively few additional labeled examples [1].

Tokenization Strategies for Multimodal Data

A critical innovation in extending scFMs beyond transcriptomics lies in developing effective tokenization strategies for representing diverse data types. Unlike natural language, omics data lacks inherent sequential ordering, requiring specialized approaches:

Table 1: Tokenization Strategies for Multi-omic Data

Data Modality Tokenization Approach Special Considerations Example Models
scRNA-seq Genes as tokens ordered by expression level; value embeddings for expression Non-sequential nature of genes; high sparsity scGPT, Geneformer
scATAC-seq Chromatin accessibility peaks as tokens; accessibility scores as values High dimensionality; binary nature scGPT, MultiVI
Spatial Transcriptomics Spatial coordinates as positional encodings; gene expression tokens Spatial neighborhood relationships Nicheformer, stClinic
Protein Abundance Surface proteins as tokens; abundance levels as values Limited feature space (typically <200 proteins) CITE-seq models
Multiome Modality-specific tokens with modality indicators Integration of simultaneous measurements scPairing, scGPT

For multimodal integration, researchers have introduced special tokens indicating modality, species, technology, and batch information, enabling the model to learn both shared and modality-specific representations [1] [27]. Positional encoding schemes are adapted to represent the relative order or rank of each feature within a cell.

Protocols for Multi-omic Data Integration

Protocol 1: Cross-modal Alignment with scPairing

Principle: scPairing integrates separate unimodal datasets to generate artificial multiomics data through contrastive learning in a shared embedding space, addressing the scarcity of true multiomics data [34].

Experimental Workflow:

  • Input Data Preparation:

    • Collect unimodal datasets (e.g., scRNA-seq and scATAC-seq) from the same biological system
    • Perform standard preprocessing: quality control, normalization, and feature selection for each modality
    • Identify anchor features (e.g., genes linked to chromatin accessibility peaks) for cross-modal alignment
  • Model Configuration:

    • Initialize scPairing architecture with modality-specific encoders
    • Configure contrastive learning objective to maximize similarity between embeddings of matched cellular states across modalities
    • Set hyperparameters: embedding dimension (typically 512-1024), batch size, and temperature parameter for contrastive loss
  • Training Procedure:

    • Train model using alternating optimization between modalities
    • Monitor alignment metrics: canonical correlation analysis (CCA) and mean squared error (MSE) between projected embeddings
    • Apply early stopping based on validation set performance
  • Multi-omics Generation:

    • Project separate unimodal datasets into the shared embedding space
    • Generate paired multiomics profiles by matching cells across modalities based on embedding similarity
    • Validate generated data by comparing with held-out true multiomics data

Applications: scPairing has been successfully applied to generate multiomics data for retina, immune, and renal cells, and can be extended to generate trimodal data [34].

G unimodal_rna scRNA-seq Data encoder_rna RNA Encoder unimodal_rna->encoder_rna unimodal_atac scATAC-seq Data encoder_atac ATAC Encoder unimodal_atac->encoder_atac contrastive_loss Contrastive Learning encoder_rna->contrastive_loss encoder_atac->contrastive_loss shared_embedding Shared Embedding Space contrastive_loss->shared_embedding multiomic_output Artificial Multiomics Data shared_embedding->multiomic_output

Figure 1: scPairing Cross-modal Alignment Workflow

Protocol 2: Zero-shot Multimodal Cell Typing with scGPT

Principle: scGPT leverages large-scale pretraining on over 33 million cells to enable zero-shot cell type annotation across multiple modalities without task-specific fine-tuning [26].

Experimental Workflow:

  • Data Preprocessing:

    • For each modality, format data as gene-protein feature matrices
    • Apply scGPT's standardized normalization: log(1+CP10K) for RNA, arcsinh(5×) for ADT data
    • Handle missing features through scGPT's builtin imputation or zero-padding
  • Model Initialization:

    • Load pretrained scGPT model weights (available through BioLLM framework)
    • Configure model for multimodal input using modality-specific tokenization
    • Set context length to accommodate combined feature set (typically 1200-1500 tokens)
  • Embedding Extraction:

    • Forward pass multimodal data through frozen pretrained model
    • Extract cell embeddings from the [CLS] token or mean pooling of last hidden layer
    • Reduce dimensionality using UMAP or t-SNE for visualization
  • Zero-shot Classification:

    • Compute cosine similarity between query cell embeddings and reference cell type centroids
    • Assign cell types based on nearest neighbors in embedding space
    • Apply confidence thresholds based on distance to nearest centroid

Validation Metrics: Report accuracy, F1-score, and confusion matrix for cell type annotation, and use Local Inverse Simpson's Index (LISI) to assess integration quality [2].

Protocols for Spatial Data Integration

Protocol 3: Spatial Context Transfer with Nicheformer

Principle: Nicheformer is a transformer-based foundation model pretrained on both dissociated single-cell and spatial transcriptomics data (SpatialCorpus-110M) that captures spatial context and enables spatial information transfer to dissociated data [27].

Experimental Workflow:

  • Data Curation:

    • Collect spatial transcriptomics data (MERFISH, Xenium, CosMx, or ISS technologies)
    • Process dissociated scRNA-seq data from comparable biological systems
    • Map orthologous genes across species (human and mouse) for cross-species applications
  • Model Pretraining (Optional):

    • Initialize transformer architecture with 12 encoder layers, 16 attention heads
    • Implement rank-based tokenization with technology-specific normalization
    • Train with masked gene modeling objective on spatial and dissociated data jointly
    • Incorporate contextual tokens for species, modality, and technology
  • Spatial Tasks:

    • Spatial composition prediction: Predict local cellular density and cell-type composition around each cell
    • Spatial label prediction: Transfer spatially-defined annotations (e.g., niche labels) to dissociated cells
    • Linear probing: Train simple classifiers on frozen Nicheformer embeddings for spatial tasks
  • Validation:

    • Compare spatial context predictions with ground truth manual annotations
    • Assess model uncertainty using confidence calibration metrics
    • Validate biological insights through comparison with known spatial patterns

Key Innovation: Nicheformer demonstrates that models trained only on dissociated data fail to recover the complexity of spatial microenvironments, underscoring the necessity of multiscale integration [27].

Table 2: Performance Comparison of Spatial Foundation Models

Model Training Data Spatial Composition Prediction (Accuracy) Spatial Label Transfer (F1) Compute Requirements
Nicheformer 57M dissociated + 53M spatial cells 0.89 0.85 High (49.3M parameters)
CellPLM 9M dissociated + 2M spatial cells 0.76 0.72 Medium
Geneformer Dissociated only 0.62 0.58 Medium
scGPT Dissociated only 0.65 0.61 High

Protocol 4: Clinically Relevant Niche Analysis with stClinic

Principle: stClinic integrates spatial multi-slice multi-omics (SMSMO) and clinical data through dynamic graph modeling to identify clinically relevant cellular niches and their association with patient outcomes [35].

Experimental Workflow:

  • Data Integration:

    • Collect SMSMO data from multiple tissue slices (transcriptomics, epigenomics, proteomics)
    • Incorporate clinical metadata: survival time, treatment response, disease stage
    • Preprocess using MultiVI or Seurat for initial feature extraction
  • Graph Construction:

    • Build spatial neighborhood graphs within each slice using k-nearest neighbors (k=15)
    • Construct cross-slice similarity graphs based on feature profiles
    • Create unified graph combining spatial and feature similarities
  • Model Training:

    • Initialize stClinic with variational graph attention encoder (VGAE)
    • Train with Mixture-of-Gaussian (MOG) prior on latent features
    • Implement iterative graph refinement removing links between dissimilar nodes
    • Incorporate attention mechanisms to weight important niches
  • Clinical Association:

    • Represent each slice using niche vectors with six geometric statistical measures
    • Train supervised models to predict clinical outcomes from niche representations
    • Identify significant niches enriched in specific clinical groups
  • Zero-shot Transfer:

    • Use trained encoder to map new samples into shared feature space
    • Transfer niche labels from reference to query datasets without retraining
    • Validate transferred labels using spatial context and marker expression

Applications: stClinic has identified aggressive niches enriched with tumor-associated macrophages and favorable prognostic niches abundant in B and plasma cells across breast cancer, colorectal cancer, and liver metastasis datasets [35].

G smsmo_data SMSMO Data graph_construction Dynamic Graph Construction smsmo_data->graph_construction clinical_data Clinical Data clinical_prediction Clinical Outcome Prediction clinical_data->clinical_prediction vgae_encoder VGAE Encoder graph_construction->vgae_encoder mog_prior MOG Prior vgae_encoder->mog_prior latent_features Latent Features mog_prior->latent_features niche_vectors Niche Vectors latent_features->niche_vectors zero_shot Zero-shot Transfer latent_features->zero_shot niche_vectors->clinical_prediction

Figure 2: stClinic Dynamic Graph Analysis Workflow

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Research Reagent Solutions for Multi-omic Spatial Analysis

Resource Type Function Access
CZ CELLxGENE Data Platform Provides unified access to >100 million annotated single cells Public portal
SpatialCorpus-110M Training Data Curated collection of 57M dissociated + 53M spatial cells for pretraining Research use
BioLLM Benchmarking Framework Standardized interface for evaluating >15 foundation models Open source
DISCO Data Resource Federated database aggregating single-cell data Public portal
Pathway Tools Visualization Software Enables simultaneous visualization of up to 4 omics data types on metabolic charts Academic license
scGPT Weights Pretrained Model Foundation model parameters pretrained on 33M+ cells Research use
Nicheformer Code Model Implementation Transformer for spatial and dissociated data integration GitHub repository
stClinic Package Clinical Analysis Dynamic graph model for SMSMO and clinical data integration Upon request

Discussion and Future Perspectives

The integration of multi-omic and spatial data modalities within zero-shot learning foundation models represents a paradigm shift in single-cell computational biology. The protocols outlined in this application note provide actionable frameworks for researchers to leverage these advanced methodologies in their investigations.

Critical challenges remain in several areas. Technical variability across platforms continues to complicate integration, with different technologies exhibiting distinct bias profiles that models must account for [27]. Interpretability of foundation model predictions requires further development, particularly for clinical translation where understanding model reasoning is essential. Computational scalability presents ongoing challenges as dataset sizes continue to grow exponentially.

Future directions should focus on several key areas. First, developing standardized benchmarking frameworks specifically designed for multimodal foundation models will enable more rigorous comparison and selection of appropriate methods for specific applications. Second, creating multimodal knowledge graphs that incorporate prior biological knowledge can enhance model interpretability and biological relevance. Finally, establishing federated learning frameworks will enable model training across distributed datasets while preserving data privacy, particularly important for clinical applications.

The convergence of multimodal single-cell technologies with advanced foundation model architectures promises to unlock new insights into cellular biology and disease mechanisms. By providing detailed protocols and analytical frameworks, this application note aims to equip researchers with the tools necessary to advance beyond transcriptomics and leverage the full potential of integrated multi-omic and spatial data in the era of single-cell foundation models.

The advent of single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular heterogeneity in cancer, presenting new opportunities for precision medicine. However, translating these complex, high-dimensional datasets into actionable therapeutic insights remains a significant challenge. Single-cell foundation models (scFMs), pretrained on millions of cells using self-supervised learning, have emerged as powerful tools for decoding this complexity. These models learn universal biological representations that enable zero-shot learning and transfer across diverse downstream tasks without task-specific retraining [26]. This case study explores the application of scFMs to one of oncology's most pressing challenges: predicting individual patient drug sensitivity from single-cell transcriptomic profiles. By leveraging the emergent properties of foundation models, researchers can now interrogate cellular response mechanisms at unprecedented resolution, potentially accelerating the development of personalized cancer therapies.

Background

The Drug Sensitivity Prediction Challenge

Cancer treatment continues to evolve toward precision medicine, yet effective treatment selection remains hampered by tumor heterogeneity and limited predictive biomarkers. Traditional bulk RNA sequencing masks cellular subpopulations that may drive treatment resistance, while functional drug screening using patient-derived cells faces practical limitations in cost, scalability, and clinical translation [36]. Machine learning approaches have shown promise but often struggle with the high dimensionality, technical noise, and batch effects inherent in single-cell data [2]. The field requires methods that can generalize across datasets, capture subtle biological signals, and provide interpretable predictions for clinical decision-making.

Single-Cell Foundation Models

Foundation models represent a paradigm shift in single-cell data analysis. Originally developed for natural language processing, these models employ transformer-based architectures to learn fundamental biological principles from massive, diverse collections of single-cell data. Through pretraining objectives like masked gene modeling and contrastive learning, scFMs capture hierarchical patterns of gene regulation, cellular states, and biological processes [26]. Notable examples include scGPT (pretrained on over 33 million cells) and Geneformer, which demonstrate remarkable cross-task generalization capabilities including zero-shot cell type annotation and perturbation response prediction [26] [2]. Unlike traditional single-task models, scFMs create a universal representation space that encodes biological knowledge transferable to novel prediction tasks with minimal fine-tuning.

Key scFMs for Drug Sensitivity Prediction

Table 1: Foundation Models for Single-Cell Drug Response Prediction

Model Architecture Pretraining Scale Key Strengths Reported Performance
scGPT Transformer 33+ million cells [26] Zero-shot annotation, multi-omic integration, perturbation modeling [26] Superior cross-task generalization; robust benchmark performance [26] [2]
Geneformer Transformer Millions of cells [2] Contextual gene embeddings, mechanism of action analysis [2] Captures biologically meaningful relationships; transferable representations [2]
scPlantFormer Phylogenetic transformer 1 million plant cells [26] Cross-species integration, lightweight architecture 92% cross-species annotation accuracy [26]
Nicheformer Graph transformer 53 million spatial cells [26] Spatial context modeling, niche environment effects Spatial context prediction and integration [26]

Experimental Protocols

Zero-Shot Drug Sensitivity Prediction Workflow

workflow input Input: Single-cell Transcriptomics scrna scRNA-seq Data Matrix input->scrna scfm scFM Embedding (e.g., scGPT) scrna->scfm zeroshot Zero-shot Prediction Head scfm->zeroshot output Drug Sensitivity Scores zeroshot->output

Protocol 1: Zero-Shot Prediction Using Pretrained scFM Embeddings

  • Input Data Preparation: Process single-cell transcriptomics data (raw or normalized counts) for patient-derived cells or tumor samples. Data should be formatted to match the pretraining corpus gene space of the target scFM [2].

  • Embedding Generation: Extract cell embeddings from the final layer of the pretrained scFM without fine-tuning. For scGPT, this involves forward propagation of the expression matrix through the transformer architecture to obtain contextual cell representations [26] [2].

  • Drug Response Prediction: Apply a zero-shot prediction head to map embeddings to drug sensitivity scores. This can be implemented as:

    • A similarity-based approach comparing query cell embeddings to reference drug response profiles
    • A linear probe trained on limited labeled data while keeping the scFM backbone frozen
    • Direct inference using the model's inherent perturbation modeling capabilities [26]
  • Validation: Evaluate predictions against experimental drug screening data using correlation metrics (Pearson/Spearman R) and classification metrics (AUC-ROC) for binarized sensitivity thresholds [2].

Interpretable Mechanism-of-Action Analysis

moa model scFM Prediction Model interp Interpretability Methods model->interp shap SHAP Analysis interp->shap attn Attention Mechanisms interp->attn genes Important Genes shap->genes attn->genes moa MOA Pathway Enrichment genes->moa

Protocol 2: Interpretable MOA Analysis with scFMs

  • Feature Importance Calculation: Apply model interpretability techniques to identify genes driving predictions:

    • SHAP Analysis: Compute Shapley values to quantify each gene's contribution to predicted IC50 values [37].
    • Attention Analysis: Extract attention weights from transformer layers to identify biologically relevant gene-gene interactions [26].
  • MOA Pathway Validation: Test whether identified important genes are enriched in known drug mechanism-of-action pathways:

    • Retrieve drug-MOA pathways from Reactome, KEGG, or using LLM-curated annotations [37].
    • Perform gene set enrichment analysis (GSEA) on important genes.
    • Statistically evaluate target recovery rates against background distributions [37].
  • Biological Validation: Correlate model-derived important genes with CRISPR screening data (DepMap) to confirm functional relevance in specific cancer contexts [37].

Performance Benchmarking

Table 2: Benchmarking scFM Performance Across Drug Prediction Tasks

Task Dataset Best Performing scFM Performance Metrics Traditional ML Baseline
Batch Integration 5 datasets with inter-patient, platform, tissue variations [2] scGPT (zero-shot) Improved biological structure preservation Seurat, Harmony, scVI [2]
Cell Type Annotation Cross-tissue, novel cell types [2] scPlantFormer 92% cross-species accuracy [26] HVG selection + clustering
Cancer Cell Identification 7 cancer types [2] Ensemble scFMs High accuracy in tumor microenvironment Tissue-specific classifiers
Drug Sensitivity Prediction GDSC, PRISM datasets [37] XGBoost on scFM embeddings ρ = 0.88-0.89 Pearson correlation [37] All-genes models (ρ = 0.40 median) [37]
Selective Drug Prediction GDSC subset (active in <20% cell lines) [36] scFM with random forest 3.6/10 accurate in top-10 predictions [36] Simple recommender systems

Research Reagent Solutions

Table 3: Essential Research Resources for scFM Drug Sensitivity Studies

Resource Category Specific Tools/Datasets Function and Application Key Features
Computational Frameworks scGPT [26], BioLLM [26] Universal interfaces for benchmarking scFMs Standardized access to 15+ foundation models
Data Repositories DISCO [26], CZ CELLxGENE [26], GDSC [37], PRISM [37] Provide pretraining corpora and drug response validation data 100M+ cells aggregated for federated analysis
Alignment Tools Celligner [37] Matches cell line to patient transcriptomics Enables clinical translation of models
Interpretability Packages SHAP [37], integrated attention visualizers [26] Model interpretation and MOA discovery Quantifies gene contribution to predictions
Clinical Translation Platforms CellHit pipeline [37] End-to-end drug prediction framework Combines scFMs with clinical data alignment

Implementation Framework

Integrated Clinical Translation Pipeline

Protocol 3: End-to-End Clinical Drug Prediction Using scFMs

  • Data Acquisition and Processing:

    • Obtain single-cell RNA-seq data from patient tumor samples
    • Align with cancer cell line data using tools like Celligner to enable knowledge transfer [37]
    • Perform quality control and normalization compatible with target scFM
  • Model Selection and Inference:

    • Select appropriate scFM based on task complexity and dataset size [2]
    • Generate cell embeddings using zero-shot protocol to preserve biological variation
    • For limited data scenarios, apply lightweight fine-tuning of prediction heads
  • Clinical Validation and Translation:

    • Validate predictions against patient-derived cell culture drug screens where available [36]
    • Apply interpretability analysis to build confidence in predictions
    • Generate ranked therapeutic recommendations with confidence estimates

Single-cell foundation models represent a transformative approach for predicting drug sensitivity in cancer research. By leveraging large-scale pretraining and zero-shot learning capabilities, scFMs overcome critical limitations of traditional methods in handling cellular heterogeneity, technical noise, and dataset integration. The protocols and frameworks presented herein provide researchers with practical guidance for implementing these advanced computational methods. As the field evolves, increasing model interpretability, standardization of benchmarks, and tighter integration with functional validation will be essential for translating scFM-based predictions into clinically actionable insights. The emerging paradigm of foundation models in single-cell analysis promises to accelerate personalized oncology by bridging high-resolution molecular profiling with effective therapeutic selection.

Navigating Challenges and Optimizing Single-Cell Foundation Model Performance

Single-cell foundation models (scFMs) represent a paradigm shift in computational biology, leveraging large-scale pretraining on massive single-cell transcriptomic datasets to learn universal representations of cellular biology [1]. These models, built on transformer architectures, are designed to be adaptable to a wide range of downstream tasks with minimal task-specific training, including zero-shot learning where models are applied without any fine-tuning [1] [38]. The promise of scFMs lies in their potential to capture fundamental biological principles that generalize across tissues, species, and experimental conditions.

However, as scFMs move from development to practical application, a growing body of evidence suggests their performance in zero-shot settings frequently fails to exceed that of simpler, established computational methods [5] [39] [38]. This application note synthesizes recent benchmarking studies to identify specific scenarios where this performance gap occurs, analyzes the underlying causes, and provides standardized protocols for evaluating scFMs against appropriate baselines. Understanding these limitations is crucial for researchers, scientists, and drug development professionals seeking to incorporate scFMs into their analytical workflows while avoiding potential pitfalls.

Quantitative Performance Landscape

Recent comprehensive benchmarking studies reveal that scFMs show inconsistent performance across standard single-cell analysis tasks when compared to traditional computational methods. The table below summarizes key findings from multiple evaluations comparing scFMs against established baselines.

Table 1: Performance Comparison of scFMs vs. Baselines Across Key Tasks

Task Domain Evaluation Metric Top-Performing Methods scFM Performance Key Findings
Cell Type Clustering Average BIO (AvgBIO) score, Average Silhouette Width (ASW) HVG selection, scVI, Harmony [38] Geneformer and scGPT underperform HVG and established methods across most datasets [38] HVG selection consistently outperforms both Geneformer and scGPT across all metrics [38]
Batch Integration Batch mixing scores, Principal Component Regression (PCR) HVG selection, scVI, Harmony [38] Geneformer consistently ranks last; scGPT shows variable performance [38] Best batch integration scores for all datasets achieved by selecting HVGs [38]
Perturbation Effect Prediction Multiple accuracy metrics Simple baseline models [39] scFM embeddings do not provide consistent improvements over baselines, especially under distribution shift [39] All models struggle with predicting strong or atypical perturbation effects [39]
Gene-Level Tasks Tissue specificity, GO term prediction Geneformer, scFoundation [17] scGPT shows robust performance across tasks; scBERT lags due to smaller size and limited training data [17] Performance varies significantly across models and tasks with no single scFM consistently dominating [2] [17]

Benchmarking analysis indicates that the relationship between pretraining dataset size and model performance is not straightforward. While pretraining generally provides benefits over randomly initialized models, extremely large and diverse pretraining datasets do not necessarily confer additional advantages for specific downstream tasks [38]. In some cases, models pretrained on tissue-specific data (e.g., scGPT-blood) outperform models trained on more diverse datasets (e.g., scGPT-human) even for tasks involving other tissue types [38].

Experimental Protocols for scFM Evaluation

Protocol 1: Zero-Shot Cell Type Clustering Benchmark

Purpose: To evaluate the quality of scFM-derived cell embeddings for distinguishing known cell types without task-specific fine-tuning.

Materials:

  • Test Datasets: Curated scRNA-seq datasets with high-quality cell type annotations (e.g., Tabula Sapiens, Pancreas, PBMC datasets) [38]
  • Benchmarking Models: scFMs (Geneformer, scGPT, scFoundation, etc.) and baseline methods (HVG selection, scVI, Harmony)
  • Evaluation Metrics: Average BIO (AvgBIO) score, Average Silhouette Width (ASW) [38]

Procedure:

  • Data Preparation: Standardize test datasets using consistent quality control, normalization, and filtering procedures
  • Embedding Generation: Extract zero-shot cell embeddings from each scFM using the authors' recommended protocols
  • Baseline Generation: Apply traditional methods (HVG selection, scVI, Harmony) to the same test datasets
  • Dimensionality Reduction: Apply UMAP or t-SNE to all embedding types for visualization
  • Cluster Validation: Calculate evaluation metrics by comparing cluster assignments with ground-truth cell type labels
  • Statistical Analysis: Perform multiple comparative tests across datasets and methods

Expected Outcomes: Simpler methods like HVG selection are expected to outperform or match scFMs in most cell type clustering tasks, providing a critical baseline for evaluating the added value of scFM embeddings [38].

Protocol 2: Batch Integration Assessment

Purpose: To assess scFM capability to remove technical batch effects while preserving biological variation in zero-shot settings.

Materials:

  • Test Datasets: Datasets with known batch effects from multiple sources (e.g., Pancreas benchmark with data from five different sources) [38]
  • Evaluation Metrics: Batch integration scores, PCR, proportion of variance explained by batch effects [38]

Procedure:

  • Dataset Selection: Curate datasets with mixed technical (protocol, platform) and biological (donor, condition) batch effects
  • Embedding Extraction: Generate zero-shot cell embeddings using target scFMs
  • Visualization: Create 2D embeddings colored by batch and cell type identity
  • Quantitative Assessment: Calculate batch mixing metrics comparing within-batch versus between-batch cell distances
  • Biological Preservation: Evaluate whether biological variation remains detectable after batch effect removal
  • Comparative Analysis: Rank methods by their ability to simultaneously minimize batch effects and preserve biological signals

Expected Outcomes: Traditional methods like Harmony and scVI typically outperform scFMs in batch correction, with Geneformer often increasing batch effects compared to raw data [38].

Protocol 3: Perturbation Prediction Evaluation

Purpose: To evaluate scFM performance in predicting transcriptional responses to genetic perturbations.

Materials:

  • Benchmark Framework: PertEval-scFM standardized evaluation framework [39]
  • Test Data: Large-scale perturbation datasets with transcriptome-wide profiles [40]
  • Evaluation Metrics: Prediction accuracy for direction and magnitude of expression changes

Procedure:

  • Data Splitting: Implement non-standard data splits where no perturbation condition occurs in both training and test sets
  • Model Evaluation: Assess zero-shot scFM embeddings against simpler baseline models
  • Distribution Shift Testing: Evaluate performance under conditions that differ from pretraining data distributions
  • Effect Strength Analysis: Stratify results by perturbation strength and type
  • Comparative Analysis: Rank methods by prediction accuracy across different perturbation classes

Expected Outcomes: scFMs generally fail to consistently outperform simpler baselines for perturbation prediction, particularly for strong or atypical perturbations and under distribution shift [39].

Visualizing Evaluation Workflows

G cluster_inputs Input Data cluster_methods Analytical Methods cluster_tasks Evaluation Tasks cluster_metrics Performance Metrics RawData Raw scRNA-seq Data scFMs scFMs (Geneformer, scGPT, etc.) RawData->scFMs Traditional Traditional Methods (HVG, scVI, Harmony) RawData->Traditional Annotations Cell Type Annotations Clustering Cell Type Clustering Annotations->Clustering BatchInfo Batch Metadata BatchIntegration Batch Effect Integration BatchInfo->BatchIntegration PerturbationData Perturbation Datasets Perturbation Perturbation Prediction PerturbationData->Perturbation scFMs->Clustering scFMs->BatchIntegration scFMs->Perturbation GeneTasks Gene-level Tasks scFMs->GeneTasks Traditional->Clustering Traditional->BatchIntegration Traditional->Perturbation Traditional->GeneTasks BioMetrics Biological Metrics (ASW, AvgBIO) Clustering->BioMetrics BatchMetrics Batch Mixing Scores (PCR) BatchIntegration->BatchMetrics PredictionMetrics Prediction Accuracy (MAE, Correlation) Perturbation->PredictionMetrics OntologyMetrics Ontology-based Metrics (scGraph-OntoRWR, LCAD) GeneTasks->OntologyMetrics Outcome Performance Gap Analysis BioMetrics->Outcome BatchMetrics->Outcome PredictionMetrics->Outcome OntologyMetrics->Outcome

Figure 1: Comprehensive scFM Evaluation Workflow. This workflow outlines the standardized approach for benchmarking single-cell foundation models against traditional methods across key analytical tasks.

Critical Factors Contributing to scFM Underperformance

Architectural and Training Limitations

The transformer architecture, while powerful for sequential data like text, faces fundamental challenges when applied to single-cell data where gene-gene interactions are non-sequential and dynamic [2] [1]. Current scFMs rely on various strategies to impose order on inherently unordered gene expression data, including ranking genes by expression levels or binning expression values [1]. These arbitrary orderings may not capture true biological relationships and can introduce artifacts that limit model generalization.

The masked language model pretraining objective used by most scFMs (Geneformer, scGPT) may not optimally capture the biological information needed for diverse downstream tasks [38]. This pretraining approach focuses on predicting masked genes based on their context, which does not necessarily translate to effective learning of cell-type discriminative features or batch-effect-invariant representations.

Data Quality and Compatibility Issues

Substantial technical variability across single-cell sequencing platforms presents significant challenges for scFMs [26]. Batch effects, technical noise, and platform-specific artifacts in pretraining data can propagate through to model embeddings, reducing their utility for zero-shot applications [1] [26]. Furthermore, the relationship between pretraining data composition and downstream task performance appears complex, with tissue-specific pretraining sometimes outperforming more diverse pretraining even for cross-tissue applications [38].

Data leakage concerns complicate model evaluation, as some test datasets may have been included in scFM pretraining corpora [38]. Surprisingly, even when evaluated on datasets seen during pretraining, scFMs do not consistently outperform simpler methods, indicating potential limitations in how effectively these models extract and retain biologically relevant information during pretraining [38].

Task-Specific Limitations

Current scFMs demonstrate particular weaknesses in batch integration tasks, where Geneformer embeddings sometimes amplify rather than reduce batch effects compared to raw data [38]. This suggests that the pretraining process may not adequately teach models to distinguish technical artifacts from biological signals.

For perturbation prediction, scFMs struggle with strong or atypical perturbation effects and show limited generalization under distribution shift [39]. This indicates that the models may be learning to predict average cellular behaviors rather than capturing the full spectrum of possible cellular responses to perturbations.

Table 2: Key Research Reagents and Computational Resources for scFM Evaluation

Resource Type Primary Function Access Information
BioLLM Framework Software Framework Unified interface for integrating and evaluating diverse scFMs [17] Standardized APIs for model switching and benchmarking
PertEval-scFM Benchmarking Framework Standardized evaluation of perturbation prediction capabilities [39] Specialized framework for perturbation effect prediction
CELLxGENE Census Data Resource Curated single-cell data for pretraining and evaluation [26] [24] >100 million standardized cells for model development
scGPT Foundation Model Generative pretrained transformer for single-cell analysis [26] 33M+ cell pretraining; strong multi-task performance [17]
Geneformer Foundation Model Transformer model pretrained on single-cell transcriptomes [38] Emphasis on gene-level tasks and network inference
Harmony Baseline Method Batch integration and data harmonization [38] Established baseline for integration tasks
scVI Baseline Method Generative model for scRNA-seq analysis [38] Probabilistic modeling of single-cell data
HVG Selection Baseline Method Feature selection based on high variability [38] Surprisingly competitive baseline for many tasks

Conceptual Framework for scFM Limitations

G cluster_problems Performance Gap Contributors cluster_specific Specific Challenges cluster_solutions Potential Mitigation Strategies Architectural Architectural Mismatch Sequence Non-sequential nature of gene interactions Architectural->Sequence Tokenization Suboptimal gene tokenization strategies Architectural->Tokenization DataQuality Data Quality Issues BatchEffects Batch effect propagation in pretraining data DataQuality->BatchEffects DataLeakage Data leakage in evaluation DataQuality->DataLeakage Pretraining Pretraining Limitations MaskedModeling Masked language modeling objective limitations Pretraining->MaskedModeling DistributionShift Poor generalization under distribution shift Pretraining->DistributionShift TaskSpecific Task-Specific Weaknesses BatchIntegrationGap Batch integration underperformance TaskSpecific->BatchIntegrationGap PerturbationGap Perturbation prediction limitations TaskSpecific->PerturbationGap NewArchitectures Biology-inspired architectures Sequence->NewArchitectures BetterTokenization Improved gene representation Tokenization->BetterTokenization DataCuration Enhanced data curation BatchEffects->DataCuration RobustEvaluation Rigorous zero-shot evaluation DataLeakage->RobustEvaluation SpecializedTraining Task-aware pretraining objectives MaskedModeling->SpecializedTraining DistributionShift->RobustEvaluation BatchIntegrationGap->RobustEvaluation PerturbationGap->RobustEvaluation

Figure 2: scFM Performance Gap Analysis Framework. This diagram illustrates the key factors contributing to scFM underperformance and potential strategies for addressing these limitations.

The performance gaps between scFMs and simpler baseline methods in zero-shot settings highlight the ongoing challenges in developing truly robust and generalizable foundation models for single-cell biology. Rather than dismissing scFMs entirely, these findings should guide more targeted development efforts focusing on specific limitations.

Future work should prioritize developing biological meaningful evaluation metrics like scGraph-OntoRWR, which measures the consistency of cell type relationships captured by scFMs with prior biological knowledge [2]. Additionally, standardized benchmarking frameworks like BioLLM [17] and PertEval-scFM [39] will enable more rigorous and comparable evaluations across the field.

For researchers currently applying these tools, we recommend a cautious approach that includes always comparing scFM performance against simpler baselines like HVG selection, scVI, and Harmony, particularly for critical analyses where accuracy is essential. As the field evolves, addressing the fundamental architectural and training limitations identified in this application note will be essential for realizing the full potential of foundation models in single-cell genomics and translational research.

In single-cell RNA sequencing (scRNA-seq) research, technical artifacts introduced through variations in experiments, sequencing platforms, or sample preparation processes can generate batch effects that mask true biological signals [41] [42]. These technical confounders represent a significant hurdle for all analytical approaches, including emerging zero-shot learning foundation models that promise to accelerate biological discovery without task-specific training [4] [5]. The fundamental challenge lies in distinguishing biologically irrelevant technical noise from meaningful biological variation, particularly when analyzing data from multiple sources or experimental conditions.

The critical importance of this challenge is underscored by recent evaluations of single-cell foundation models such as scGPT and Geneformer, which have demonstrated limited zero-shot performance in batch integration tasks [4] [3]. In some cases, these sophisticated models are outperformed by traditional computational methods and even simple feature selection approaches like selecting highly variable genes [4] [3]. This reveals a crucial gap in our current analytical capabilities and highlights the necessity of robust preprocessing and quality control protocols to ensure data quality before applying foundation models.

Understanding Technical Noise and Batch Effects

Technical noise in scRNA-seq data arises from multiple sources throughout the experimental workflow. Ambient RNA contamination occurs when transcripts from damaged or apoptotic cells leak out during single-cell isolation and become encapsulated in droplets along with other cells [42]. Additional artifacts include barcode swapping (incorrect binding between barcodes during sequencing) and multiplets (where more than one cell is captured within a single droplet or microwell) [42]. The multiplet rate is influenced by the scRNA-seq platform and the number of loaded cells; for example, 10x Genomics reports a 5.4% multiplet rate when 7,000 target cells are loaded, escalating to 7.6% with 10,000 cells [42].

Batch effects represent another significant category of technical variation, stemming from differences in experimental conditions, tissue storage, dissociation processes, and sequencing library preparation [42]. These effects can cause clusters to appear as distinct cell types even when they are actually the same, potentially leading to erroneous biological interpretations if not properly addressed.

Impact on Foundation Model Performance

The presence of technical noise and batch effects poses particular challenges for single-cell foundation models. Recent zero-shot evaluations of Geneformer and scGPT revealed that these models often fail to correct for batch effects between different experimental techniques [4]. In some cases, Geneformer's embedding space failed to retain information about cell type, with clustering primarily driven by batch effects rather than biological reality [4]. While scGPT's embeddings offered some separation between cell types, the primary structure in dimensionality reduction was still dominated by technical variation [4].

Quantitative evaluation with batch integration metrics demonstrated that both Geneformer and scGPT underperformed relative to established methods like Harmony and scVI across most datasets [4]. Surprisingly, the best batch integration scores for all datasets were achieved by simply selecting highly variable genes, highlighting the continued importance of fundamental preprocessing steps [4].

Quantitative Evaluation of Batch Correction Methods

Table 1: Performance Comparison of Batch Correction Methods Across Multiple Metrics

Method Cell Type Clustering (AvgBIO Score) Batch Integration (Pancreas Dataset) Computational Efficiency Preservation of Rare Cell Types
Harmony Moderate to High Excellent for technical variation High Moderate
scVI High Excellent for technical variation Moderate Moderate
HVG Selection Variable Excellent across datasets Very High Limited
scGPT (zero-shot) Inconsistent Poor to Moderate Low Unknown
Geneformer (zero-shot) Poor Poor Low Unknown
BDACL High Not reported Not reported Excellent

Table 2: Performance of Foundation Models in Zero-Shot Cell Type Clustering

Model Performance Relative to Baselines Consistency Across Datasets Effect of Pretraining Data Batch Integration Capability
scGPT Underperforms scVI and Harmony on most datasets Variable; better on PBMC (12k) dataset Improves with pretraining, but larger datasets not always beneficial Fails to correct for batch effects between techniques
Geneformer Consistently underperforms baselines Poor across datasets Limited improvement even with pretraining data overlap Fails to retain cell type information; clustering driven by batch

Experimental Protocols for Quality Control

Comprehensive Quality Control Workflow

The following protocol outlines a standardized workflow for quality control in scRNA-seq data analysis, adapted from established best practices [42] [43] [44]:

Step 1: Initial Data Assessment

  • Import count matrices from preprocessing tools (CellRanger, STARsolo, etc.)
  • Distinguish between "Droplet" matrices (containing empty droplets), "Cell" matrices (empty droplets excluded), and "FilteredCell" matrices (poor-quality cells excluded) [44]
  • Generate preliminary quality metrics including total counts, genes detected per cell, and percentage of mitochondrial genes

Step 2: Empty Droplet Detection

  • Apply algorithms such as barcodeRanks and EmptyDrops from the dropletUtils package [44]
  • Identify the knee and inflection points in the log-log plot of barcode ranks against total counts
  • Flag barcodes with total counts below these thresholds as empty droplets
  • Remove empty droplets from subsequent analysis

Step 3: Transcript-Level Quality Control

  • Remove artifact transcripts including ambient RNA using tools like SoupX or CellBender [42]
  • Filter out overabundant genes that may induce batch effects: ribosomal genes, immunoglobulin genes, HLA genes, and specific long non-coding RNAs [42]
  • Approach stress-related gene removal cautiously, as these may reflect biological response rather than technical artifacts

Step 4: Cell-Level Quality Control

  • Detect and remove doublets/multiplets using tools like Scrublet, DoubletFinder, or doubletCells [42] [44]
  • Filter cells based on quality thresholds:
    • Remove cells with excessively high or low gene/UMI counts [42] [43]
    • Exclude cells with mitochondrial percentage exceeding 5-15% (tissue-dependent) [42] [43]
    • Apply median absolute deviation (MAD) filtering to identify outliers across multiple QC metrics [43]

Step 5: Data Normalization and Scaling

  • Regress out technical covariates including total UMIs per cell, mitochondrial gene percentage, and stress signatures [42]
  • Account for cell cycle heterogeneity by regressing out cell cycle scores [42]
  • Apply appropriate normalization methods to address differences in sequencing depth

Step 6: Batch Effect Correction

  • Select appropriate integration methods based on data complexity:
    • Use Harmony for simple integration tasks with distinct batch and biological structures [42]
    • Apply scVI for complex integration tasks such as tissue or organ atlases [42]
    • Consider BBKNN for scalable data regarding runtime and memory efficiency [42]
  • Exercise caution when correcting heterogeneous samples (e.g., tumors) to avoid removing biologically meaningful variation

Quality Control Visualization Workflow

QC_Workflow Start Raw Count Matrix EmptyDroplet Empty Droplet Detection Start->EmptyDroplet Droplet Matrix TranscriptQC Transcript QC EmptyDroplet->TranscriptQC Cell Matrix CellQC Cell-level QC TranscriptQC->CellQC Ambient RNA Removed Normalization Normalization & Scaling CellQC->Normalization FilteredCell Matrix BatchCorrection Batch Effect Correction Normalization->BatchCorrection Normalized Counts FoundationModel Foundation Model Application BatchCorrection->FoundationModel Integrated Data Analysis Downstream Analysis FoundationModel->Analysis Cell Embeddings

Diagram 1: Comprehensive Quality Control Workflow for Single-Cell RNA Sequencing Data. This workflow outlines the sequential steps for processing scRNA-seq data before application of foundation models, highlighting critical stages for addressing technical noise and batch effects.

The Scientist's Toolkit: Essential Research Reagents and Computational Tools

Table 3: Key Research Reagent Solutions for scRNA-seq Quality Control

Tool/Reagent Function Application Context
SoupX Ambient RNA removal Effective for single-nucleus data; requires some manual input of marker genes
CellBender Background noise reduction Superior for cleaning noisy datasets and extracting biological signals
Scrublet Doublet detection Scalable for large datasets; identifies multiplets in droplet-based platforms
DoubletFinder Doublet detection High accuracy impact on downstream analyses; superior statistical stability
Harmony Batch effect correction Ideal for simple integration tasks with distinct batch and biological structures
scVI Batch effect correction Suitable for complex integration tasks like tissue or organ atlases
BBKNN Batch effect correction Excellent for scalable data with runtime and memory efficiency constraints
DecontX Ambient RNA estimation Estimates contamination levels and deconvolutes native vs. contaminating RNA

Strategies for Optimizing Foundation Model Performance

Preprocessing Strategies for Enhanced Model Utility

Given the current limitations of single-cell foundation models in zero-shot settings, researchers should adopt specific preprocessing strategies to optimize performance:

Data Quality Assessment

  • Implement rigorous quality control metrics before applying foundation models
  • Utilize comprehensive pipelines like SCTK-QC that integrate multiple QC tools and generate standardized reports [44]
  • Carefully document quality thresholds and filtering parameters for reproducibility

Batch Effect Management

  • Apply appropriate batch correction methods based on data complexity and structure [42]
  • Avoid overcorrection that might remove biologically meaningful variation, particularly in heterogeneous samples like tumors [42]
  • Validate integration success using multiple metrics and visualization approaches

Feature Selection Considerations

  • Recognize that simple approaches like highly variable gene selection may outperform foundation model embeddings for some tasks [4] [3]
  • Experiment with different feature selection strategies when using foundation models in zero-shot settings
  • Document feature selection methods thoroughly to enable replication

Method Selection Framework

Method_Selection Start Start: scRNA-seq Dataset DataSize Dataset Size and Complexity? Start->DataSize BatchComplexity Batch Effect Complexity? DataSize->BatchComplexity Large/Small BiologicalQuestion Primary Biological Question? BatchComplexity->BiologicalQuestion Technical/Biological SimpleCorrection Use Simple Batch Correction (Harmony) BatchComplexity->SimpleCorrection Technical variation only ComplexCorrection Use Complex Integration (scVI, BBKNN) BatchComplexity->ComplexCorrection Mixed technical and biological variation FoundationModel Apply Foundation Model with Caution BiologicalQuestion->FoundationModel Exploratory analysis with unknown cell types TraditionalMethods Use Traditional Methods (Harmony, scVI, HVG) BiologicalQuestion->TraditionalMethods Well-defined cell types and known biology

Diagram 2: Method Selection Framework for Batch Effect Correction. This decision tree guides researchers in selecting appropriate computational methods based on dataset characteristics and research objectives, highlighting scenarios where foundation models may be appropriate versus cases where traditional methods are preferable.

The effective handling of batch effects and technical noise remains a fundamental challenge in single-cell genomics, particularly with the emergence of foundation models that promise zero-shot biological discovery. Current evidence suggests that even sophisticated foundation models like scGPT and Geneformer struggle with batch effect correction in zero-shot settings and may be outperformed by traditional methods [4] [3] [5]. This reality underscores the continued importance of rigorous quality control protocols and appropriate method selection based on specific dataset characteristics and research questions.

As the field advances, researchers must maintain a critical perspective on methodological claims, particularly regarding the zero-shot capabilities of foundation models. The development of standardized evaluation practices—including comprehensive zero-shot assessment—will be crucial for accurately measuring progress in this rapidly evolving domain [4] [3]. By implementing robust quality control workflows, selecting appropriate batch correction methods, and understanding the current limitations of foundation models, researchers can more effectively navigate the data quality hurdle and advance our understanding of cellular biology.

Single-cell foundation models (scFMs) represent a paradigm shift in computational biology, leveraging large-scale deep learning to interpret complex single-cell omics data. These models are pretrained on vast datasets through self-supervised learning, enabling adaptation to various downstream tasks such as cell type annotation, batch integration, and perturbation prediction without task-specific labels [1] [45]. The performance of scFMs in zero-shot learning settings—where models are applied without further training—is critically dependent on the quality, scale, and diversity of their pretraining data [38] [16]. This protocol examines the quantitative relationships between dataset characteristics and model efficacy, providing actionable guidelines for constructing optimized pretraining corpora for scFMs.

The Critical Role of Pretraining Data in scFMs

The foundational premise of scFMs mirrors that of large language models: exposure to massive, diverse datasets enables the learning of fundamental biological principles that generalize across tasks. In single-cell biology, individual cells are treated analogously to sentences, with genes or genomic features serving as tokens or words [1] [45]. The transformer architectures underpinning most scFMs utilize attention mechanisms to learn relationships between genes across millions of cellular contexts, forming a universal representation of cellular states and functions [1] [26].

The self-supervised pretraining process typically employs objectives like masked gene modeling, where the model learns to predict randomly masked genes based on the context of other genes in the cell [1] [15]. This process allows the model to internalize complex gene regulatory relationships, cellular functions, and expression patterns without manual annotation. The resulting model embeddings—both at the gene and cell level—encode biological knowledge that can be leveraged for diverse analytical tasks through zero-shot application or minimal fine-tuning [16] [2].

Quantitative Impact of Dataset Characteristics

Dataset Scale and Model Performance

Extensive benchmarking reveals a complex relationship between pretraining dataset size and downstream task performance. The following table summarizes empirical findings from leading scFM implementations:

Table 1: Impact of Pretraining Dataset Scale on Model Performance

Model Pretraining Dataset Size Key Performance Findings Primary Limitations
CellFM [15] 100 million human cells Outperforms existing models in cell annotation, perturbation prediction, and gene function prediction; demonstrates benefits of extreme scale for single-species modeling. Computational intensity; requires specialized infrastructure (e.g., Ascend910 NPUs).
scGPT [1] [15] 33 million human cells Strong performance in multi-omic integration and zero-shot annotation; robust across diverse tasks. Inconsistent zero-shot performance on some datasets compared to simpler methods [38].
Geneformer [1] [16] 30 million cells Effective for gene-level tasks and transfer learning; captures biologically meaningful relationships. Underperforms in zero-shot batch integration and cell type clustering [38].
scFoundation [16] [15] ~50 million cells Directly predicts raw gene expression values; preserves full data resolution. Performance varies across tasks; no consistent superiority across all benchmarks.
UCE [16] 36 million cells Integrates cross-species data using protein language models; captures molecular diversity. Large parameter count (650M) increases computational demands.
LangCell [16] 27.5 million scRNA-text pairs Incorporates cell type labels during pretraining; enables novel text-cell integration capabilities. Performance depends on quality and consistency of text annotations.

The relationship between scale and performance exhibits diminishing returns. Evaluations of scGPT variants pretrained on datasets of different sizes (from 814,000 kidney cells to 33 million diverse human cells) demonstrated that while pretraining provides clear benefits over random initialization, larger and more diverse datasets do not always confer proportional improvements [38]. In some cases, smaller tissue-specific models (e.g., scGPT blood trained on 10.3 million blood and bone marrow cells) performed comparably to or even better than the larger general model on specific tissue types [38].

Dataset Diversity and Composition

Beyond sheer volume, the diversity of cell types, tissues, and experimental conditions within pretraining data significantly impacts model robustness and generalizability:

Table 2: Impact of Dataset Diversity on Model Generalization

Diversity Dimension Impact on Model Performance Evidence from Benchmarking
Cell Type Diversity Enables recognition of rare cell types and improves cross-tissue generalization. Models trained on diverse atlases (e.g., Human Cell Atlas) outperform tissue-specific models on novel cell types [1] [16].
Species Representation Facilitates cross-species learning and evolutionary insights. UCE demonstrates effectiveness in capturing molecular diversity across species [16] [15].
Experimental Conditions Improves robustness to technical variations and batch effects. Models trained on data from multiple technologies (10x Genomics, Smart-seq2, etc.) show better integration capabilities [16] [2].
Disease States Enhances clinical relevance and disease-specific insights. Inclusion of diseased cells (e.g., 7.1M viral infection cells, 3.5M lung cancer cells) improves pathological characterization [15].

The composition balance of pretraining datasets emerges as a critical factor. Models trained on data from specific tissues (e.g., blood and bone marrow) may outperform more general models on tasks involving those same tissues, even when the general model was trained on significantly more data [38]. This suggests that strategic balancing of tissue representation, rather than simply maximizing total cell count, may optimize pretraining efficiency.

Dataset Curation Protocols and Quality Control

Standardized Curation Workflow

Implementing rigorous data curation protocols is essential for constructing high-quality pretraining datasets. The following workflow, implemented successfully for CellFM, provides a template for systematic dataset assembly:

D A Data Acquisition from Multiple Repositories B Raw Data Processing & Expression Matrix Generation A->B C Quality Control: Cell & Gene Filtering B->C D Gene Name Standardization (HGNC Guidelines) C->D E Format Conversion to Unified Sparse Matrix D->E F Metadata Annotation & Dataset Balancing E->F G Curated Dataset Ready for Model Pretraining F->G

Diagram 1: Dataset Curation and Quality Control Workflow

Critical Quality Control Measures

  • Multi-Source Data Acquisition: Collect data from diverse repositories including NCBI GEO, ENA, GSA, ImmPort, and CELLxGENE [1] [15]. CELLxGENE alone provides unified access to over 100 million standardized single-cells, representing an invaluable resource [1] [26].

  • Quality Control and Filtering:

    • Filter cells based on quality metrics: mitochondrial read percentage, unique gene counts, and total read counts [15].
    • Remove lowly expressed genes that appear in only a small fraction of cells [1].
    • Implement sample-level filtering to exclude datasets with evident technical artifacts or poor sequencing quality [1].
  • Gene Name Standardization: Apply HUGO Gene Nomenclature Committee (HGNC) guidelines consistently across all datasets to ensure uniform gene identifiers [15]. This critical step resolves discrepancies in gene symbol usage across different source datasets.

  • Metadata Annotation and Balancing:

    • Annotate cells with standardized metadata including tissue origin, disease status, donor characteristics, and experimental platform [16] [15].
    • Balance dataset composition to prevent overrepresentation of common cell types (e.g., T cells) and underrepresentation of rare populations [1] [16].
  • Batch Effect Documentation: Document technical batch effects (platform, laboratory, processing protocol) but avoid aggressive batch correction during pretraining dataset construction to preserve biological variance [1] [16].

Experimental Protocols for Evaluating Data Impact

Zero-Shot Performance Benchmarking

To quantitatively assess how dataset characteristics influence model capabilities, implement the following evaluation protocol:

  • Embedding Extraction: Generate zero-shot cell embeddings from the pretrained model without any fine-tuning [38] [16].

  • Cell Type Clustering Evaluation:

    • Apply standard clustering algorithms (e.g., Louvain, Leiden) to model embeddings.
    • Calculate Average BIO (AvgBio) score and Average Silhouette Width (ASW) to quantify cluster separation and cohesion [38].
    • Compare against baseline methods including Highly Variable Genes (HVG) selection, Harmony, and scVI [38] [16].
  • Batch Integration Assessment:

    • Apply models to datasets with known batch effects (e.g., Pancreas benchmark with five different sources) [38].
    • Quantify batch mixing using metrics such as principal component regression (PCR) score while preserving biological variation [38].
    • Visualize embeddings to confirm integration of technical replicates while maintaining separation of distinct cell types [38].
  • Biological Relevance Validation:

    • Implement ontology-informed metrics including scGraph-OntoRWR to measure consistency of captured cell type relationships with established biological knowledge [16] [2].
    • Apply Lowest Common Ancestor Distance (LCAD) to assess the severity of cell type misclassifications based on ontological proximity [16] [2].

Cross-Architecture Comparison Protocol

To isolate data impacts from architectural effects, implement cross-model benchmarking:

  • Model Selection: Include diverse architectures (Geneformer, scGPT, UCE, scFoundation, LangCell, scCello) representing different pretraining strategies [16].

  • Task Diversity: Evaluate across gene-level (gene function prediction, gene-gene relationships) and cell-level (batch integration, cell type annotation, drug sensitivity prediction) tasks [16] [2].

  • Performance Aggregation: Use non-dominated sorting algorithms to aggregate multiple evaluation metrics into holistic model rankings [16].

Table 3: Essential Research Reagents and Computational Resources for scFM Pretraining

Resource Category Specific Tools & Platforms Primary Function Access Considerations
Data Repositories CZ CELLxGENE Discover [1], DISCO [26], NCBI GEO [1], Human Cell Atlas [1] Provide standardized, annotated single-cell datasets for pretraining CELLxGENE offers >100 million cells; DISCO supports federated analysis
Computational Frameworks BioLLM [17], MindSpore (CellFM) [15], PyTorch (scGPT) [1] Unified interfaces for model training and evaluation; specialized AI frameworks BioLLM standardizes APIs across models; MindSpore optimized for Ascend chips
Pretraining Corpora Curated compendia from PanglaoDB [1], Human Ensemble Cell Atlas [1] Provide pre-integrated datasets from multiple sources Reduce curation overhead but require validation for specific use cases
Hardware Infrastructure Ascend910 NPUs [15], GPU clusters Accelerate training of large models (100M-800M parameters) CellFM required 4x Atlas800 servers with 8x Ascend910 NPUs each
Evaluation Platforms scGNN+ [26], specialized benchmarking frameworks [16] [2] Automate optimization and provide biologically informed evaluation Incorporate novel metrics like scGraph-OntoRWR for biological relevance

Optimizing pretraining datasets for single-cell foundation models requires balanced consideration of scale, diversity, and curation quality. While increasing dataset size generally improves performance, evidence suggests diminishing returns beyond certain thresholds, emphasizing the importance of strategic dataset composition and rigorous quality control [38] [16]. Future work should focus on developing standardized curation protocols, optimizing dataset balancing algorithms, and establishing rigorous benchmarks for evaluating the biological fidelity of learned representations, particularly in zero-shot settings where scFMs face their most significant challenges and opportunities [38] [16].

Single-cell foundation models (scFMs), pretrained on vast datasets using self-supervised objectives like Masked Language Modeling (MLM), promise to transform biological discovery. A critical evaluation of their zero-shot capabilities, however, reveals significant limitations. This Application Note demonstrates that in zero-shot settings—essential for exploratory biology where labels are unknown—proposed scFMs can be outperformed by simpler, established methods in tasks such as cell type clustering and batch integration. We present structured quantitative evaluations and detailed experimental protocols to guide researchers in benchmarking model performance, emphasizing that the choice of pretraining objective is paramount for developing robust, reliable, and biologically insightful scFMs.

The advent of single-cell foundation models (scFMs) represents a paradigm shift, aiming to leverage large-scale, unlabeled data to build foundational knowledge of cellular biology. These models, often based on transformer architectures, are typically pretrained using self-supervised objectives, with Masked Language Modeling (MLM) being a predominant choice [1]. In this framework, portions of a cell's gene expression profile are masked, and the model is trained to reconstruct them, analogous to how language models predict missing words [1].

A model's true generalizability, however, is most rigorously tested in a zero-shot setting, where its pretrained internal representations (embeddings) are used for downstream tasks without any task-specific fine-tuning [4]. This is not merely a technical benchmark; it is a fundamental requirement for discovery-driven science. In many research contexts, such as identifying novel cell states or characterizing heterogeneous tumor microenvironments, predefined labels do not exist, precluding the possibility of fine-tuning [4]. The performance of a model in this setting is a direct reflection of the quality and transferability of the biological knowledge acquired during pretraining. Recent evidence suggests that the current generation of scFMs, including Geneformer and scGPT, may face reliability challenges in this critical regime, sometimes being outperformed by simpler methods like highly variable gene (HVG) selection or established integration tools like Harmony and scVI [4]. This underscores the urgent need for systematic evaluation of how different pretraining objectives contribute to robust zero-shot performance.

Quantitative Evaluation of Model Performance

A rigorous, quantitative benchmark is essential for comparing the effectiveness of different models and pretraining strategies. The following tables summarize key performance metrics across critical single-cell analysis tasks.

Table 1: Zero-Shot Cell Type Clustering Performance (AvgBIO Score) This table evaluates the ability of model-generated cell embeddings to separate known cell types without further training. A higher AvgBIO score indicates better performance [4].

Model / Method PBMC (12k) Tabula Sapiens Pancreas Immune Dataset
HVG (Baseline) 0.75 0.68 0.71 0.73
scVI 0.72 0.70 0.75 0.70
Harmony 0.70 0.65 0.72 0.69
scGPT 0.78 0.62 0.68 0.65
Geneformer 0.65 0.58 0.60 0.61

Table 2: Batch Integration Performance (Batch Mixing Score) This table assesses the model's capacity to integrate data from multiple sources, removing technical batch effects while preserving biological variation. A higher score indicates better batch correction [4].

Model / Method Pancreas PBMC Tabula Sapiens Immune Dataset
HVG (Baseline) 0.89 0.91 0.85 0.88
scVI 0.85 0.88 0.80 0.75
Harmony 0.82 0.85 0.72 0.83
scGPT 0.78 0.80 0.81 0.82
Geneformer 0.65 0.68 0.62 0.64

Table 3: Comparing Pretraining Objectives in NLP Insights from natural language processing on how objectives affect representation learning. MLM excels in representation tasks, while Causal Language Modeling (CLM) shows data efficiency. A combined strategy can be optimal [46].

Pretraining Objective Model Architecture Key Strengths Key Weaknesses
Masked Language Modeling (MLM) Encoder (e.g., BERT) Robust performance across various representation tasks; bidirectional context. Less data-efficient than CLM; can be less stable during fine-tuning.
Causal Language Modeling (CLM) Decoder (e.g., GPT) High data efficiency; improved fine-tuning stability. Underperforms MLM on some text representation tasks.
Sequential (CLM then MLM) Encoder-Decoder Combines data efficiency of CLM with robust performance of MLM; optimal under fixed compute. Requires a two-stage training process.

Experimental Protocols for Benchmarking scFMs

To ensure reproducible and comparable evaluations of scFMs, researchers should adhere to the following detailed experimental protocols.

Protocol for Zero-Shot Cell Type Clustering

Objective: To evaluate the quality of a foundation model's cell embeddings in separating known cell types without any fine-tuning.

Materials:

  • A held-out test scRNA-seq dataset with ground-truth cell type labels (e.g., from Tabula Sapiens).
  • A pretrained foundation model (e.g., scGPT, Geneformer).
  • Baseline methods for comparison (e.g., HVGs, scVI, Harmony).

Procedure:

  • Data Preprocessing: Apply standard preprocessing to the test dataset, including quality control, normalization, and log-transformation of gene expression counts. Do not train or fine-tune the foundation model on this data.
  • Embedding Generation:
    • For the foundation model, input the preprocessed expression matrix and extract the cell embeddings from the model's output layer.
    • For HVGs, select the top 2,000-5,000 highly variable genes and use this reduced matrix as the embedding.
    • Generate cell embeddings using scVI and Harmony according to their standard documentation.
  • Dimensionality Reduction & Clustering: Apply principal component analysis (PCA) to all embedding matrices, followed by Leiden or Louvain clustering on a shared k-nearest neighbor (k-NN) graph built from the first 50 principal components.
  • Metric Calculation: Compute clustering metrics such as the Average BIO (AvgBIO) score and Average Silhouette Width (ASW) by comparing the clusters to the ground-truth cell type labels. The BIO score balances the completeness and homogeneity of the clustering.

Interpretation: A model whose embeddings produce higher AvgBIO and ASW scores is better at capturing biologically meaningful variation related to cell identity in a zero-shot manner.

Protocol for Zero-Shot Batch Integration

Objective: To assess a model's ability to generate embeddings that mix cells from different batches (e.g., experiments, technologies) while preserving biological cell type separations.

Materials:

  • A benchmark dataset with known batch effects and cell type labels (e.g., the Pancreas dataset from [4]).
  • The pretrained foundation model and baseline methods.

Procedure:

  • Data Preprocessing: Prepare the dataset as in Protocol 3.1, ensuring batch information is retained.
  • Embedding Generation: Generate cell embeddings for the entire dataset using the foundation model and baseline methods in a zero-shot fashion.
  • Qualitative Visualization: Project the embeddings into two dimensions using UMAP. Create two UMAP plots for each method: one colored by cell type and another colored by batch.
  • Quantitative Evaluation: Calculate two complementary metrics:
    • Batch Mixing Score: Measures the degree of intermingling between batches within cell type clusters. A higher score indicates better batch correction.
    • Principal Component Regression (PCR) Score: Quantifies the proportion of variance in the embeddings explained by batch after regressing out biological covariates. A lower PCR score indicates that less technical variation remains.

Interpretation: Successful batch integration is indicated by a high batch mixing score, a low PCR score, and UMAP plots where cells cluster primarily by cell type rather than by batch.

Protocol for Pretraining Objective Ablation

Objective: To isolate and evaluate the impact of different self-supervised pretraining objectives on downstream zero-shot performance.

Materials:

  • A large, diverse scRNA-seq corpus for pretraining (e.g., from CELLxGENE).
  • A suite of held-out benchmark tasks for evaluation (cell clustering, batch integration, perturbation prediction).

Procedure:

  • Model Architecture: Fix a single transformer architecture (e.g., a standard 6-layer encoder).
  • Objective Variation: Pretrain multiple instances of this model from scratch on the same pretraining corpus, but vary the pretraining objective:
    • MLM: A bidirectional objective where a random subset (e.g., 15-30%) of gene tokens are masked and the model must reconstruct them [1] [46].
    • CLM: A unidirectional (autoregressive) objective where the model predicts the next gene token in a sequence, given all previous tokens [46].
    • CLM+MLM: A biphasic strategy where the model is first pretrained with CLM for a portion of the steps, then the training is continued with the MLM objective [46].
  • Controlled Pretraining: Ensure all models are trained for an equal number of steps, with the same computational budget and hyperparameter tuning effort.
  • Zero-Shot Evaluation: Evaluate all pretrained models on the benchmark tasks using Protocols 3.1 and 3.2, ensuring no fine-tuning is performed.

Interpretation: This controlled ablation study directly reveals which pretraining objective leads to the most transferable and robust biological representations, separating the effect of the objective from other architectural and data-scale factors.

The Scientist's Toolkit: Essential Research Reagents

The following table details key computational tools and resources essential for conducting research in single-cell foundation models.

Table 4: Key Research Reagent Solutions for scFM Development

Reagent / Resource Type Function & Application
CELLxGENE Data Platform Provides unified access to millions of standardized, annotated single-cell datasets, serving as a primary data source for pretraining scFMs [1].
scGPT / Geneformer Foundation Model Pretrained transformer-based models for single-cell biology; used as benchmark models or for transfer learning on downstream tasks [4].
scVI Software Tool A probabilistic framework for scRNA-seq data analysis; used as a strong baseline for tasks like dimensionality reduction, clustering, and batch correction [4].
Harmony Software Tool An integration algorithm that projects cells into a shared embedding space, effectively removing batch effects; used as a baseline for integration benchmarks [4].
ONNX Format Model Format An open format for representing machine learning models. Used to export and visualize PyTorch models with tools like Netron for architectural inspection [47].

Visualizing Experimental Workflows

The following diagrams illustrate the logical relationships and experimental workflows described in this note.

scFM Zero Shot Benchmarking

Start Input: Raw scRNA-seq Test Dataset Preproc Data Preprocessing (QC, Normalization) Start->Preproc Embed Generate Cell Embeddings (Zero-Shot) Preproc->Embed Eval Evaluation Embed->Eval Cluster Cell Type Clustering (Metrics: AvgBIO, ASW) Eval->Cluster Integ Batch Integration (Metrics: Batch Mixing, PCR) Eval->Integ

Pretraining Strategy Comparison

Data Large scRNA-seq Corpus MLM MLM Pretraining (Bidirectional) Data->MLM CLM CLM Pretraining (Autoregressive) Data->CLM Data->CLM Sequential Eval Zero-Shot Evaluation on Benchmark Tasks MLM->Eval Combined CLM then MLM (Biphasic) CLM->Combined Sequential CLM->Eval Combined->Eval

The journey toward truly foundational models in single-cell biology requires moving beyond the assumption that scaling masked modeling is sufficient. As the quantitative evidence and protocols outlined here demonstrate, rigorous zero-shot evaluation is a critical litmus test. The performance gaps revealed in tasks like clustering and batch integration highlight that the current pretraining objectives may not be fully capturing the universal patterns of biology. Future development must prioritize the design of novel, biologically-grounded pretraining tasks and be validated through the systematic, zero-shot benchmarking methodologies described in this note. Only then can scFMs reliably fulfill their promise as indispensable tools for exploratory discovery in biomedicine and drug development.

Single-cell RNA sequencing (scRNA-seq) has revolutionized biological research by allowing scientists to probe transcriptomic profiles at the cellular level, revealing complex and rare cell populations that are obscured in bulk sequencing approaches [48] [49]. The analysis of this high-dimensional, sparse, and noisy data presents significant computational challenges [16]. In response, two distinct computational paradigms have emerged: traditional analysis methods and single-cell foundation models (scFMs). Traditional methods, such as those based on highly variable genes (HVG) selection, Harmony, and scVI, are well-established, computationally efficient tools designed for specific analytical tasks [4] [50]. In contrast, scFMs are large-scale deep learning models pretrained on millions of cells using self-supervised objectives, with the goal of learning universal biological principles that can be adapted to various downstream applications [16] [1].

The choice between these approaches is not straightforward, as no single scFM consistently outperforms others across all tasks, and simpler models often remain competitive, particularly in zero-shot settings where models are used without further training [16] [4]. This guide provides a structured framework for researchers to navigate this complex model selection landscape, emphasizing practical considerations related to task requirements, computational resources, and biological interpretability.

Understanding the Technologies

Traditional Single-Cell Analysis Methods

Traditional computational approaches for scRNA-seq analysis typically consist of specialized tools organized into analytical pipelines. These include methods for quality control, normalization, feature selection (e.g., Highly Variable Genes), dimensionality reduction (PCA, UMAP), clustering, and differential expression [48] [49]. Established integration algorithms like Harmony and scVI effectively correct for batch effects while preserving biological variation [4]. These methods are characterized by their focused functionality, relatively low computational demands, and well-understood statistical properties [50] [51]. They excel in well-defined analytical scenarios and remain the go-to choice for standard analyses with limited computational resources.

Single-Cell Foundation Models (scFMs)

Single-cell foundation models represent a paradigm shift from task-specific tools to general-purpose models. Inspired by large language models in natural language processing, scFMs treat individual cells as "sentences" and genes or genomic features as "words" or "tokens" [1]. These models, including Geneformer, scGPT, UCE, and scFoundation, are typically built on transformer architectures and pretrained on massive, diverse collections of single-cell data from sources like the CELLxGENE atlas, which contains over 100 million unique cells [16] [1]. Through self-supervised pretraining tasks such as masked gene modeling, scFMs learn latent representations of genes and cells that capture fundamental biological relationships [16]. These representations can then be utilized in zero-shot settings or efficiently fine-tuned for specific downstream applications, potentially uncovering insights that might be missed by traditional approaches [16].

Comparative Performance Analysis

Task-Specific Performance Evaluation

Comprehensive benchmarking studies reveal that the performance of scFMs versus traditional methods varies significantly across different analytical tasks. The table below summarizes their relative performance in key applications:

Table 1: Performance comparison across common single-cell analysis tasks

Analysis Task Superior Approach Key Findings Performance Metrics
Cell Type Clustering Traditional Methods (HVG, scVI, Harmony) scFMs (Geneformer, scGPT) underperform in zero-shot settings; pretraining provides limited benefit [4] AvgBIO score, Average Silhouette Width (ASW) [4]
Batch Integration Traditional Methods (HVG, scVI, Harmony) Geneformer consistently ranks last; scGPT shows mixed results, outperforming baselines only on specific datasets [4] Principal Component Regression (PCR), batch mixing scores [4]
Cell Type Annotation Context-Dependent scFMs show promise but require careful evaluation; errors can be measured by ontological proximity (LCAD metric) [16] Lowest Common Ancestor Distance (LCAD) [16]
Drug Sensitivity Prediction scFMs Foundation models demonstrate stronger performance in clinically relevant prediction tasks [16] Task-specific accuracy metrics [16]
Knowledge Capture scFMs scFMs better capture biological relationships aligned with prior knowledge (e.g., cell ontology) [16] scGraph-OntoRWR metric [16]

Zero-Shot Capabilities of scFMs

A critical consideration for researchers is the zero-shot performance of scFMs, where models are applied without any task-specific fine-tuning. This is particularly important in discovery settings where labels are unknown and fine-tuning is not feasible [4]. Current evaluations indicate that scFMs often face reliability challenges in zero-shot configurations and can be outperformed by simpler methods [4] [6]. For instance, in both cell type clustering and batch integration tasks, selecting highly variable genes (HVG) frequently outperforms both Geneformer and scGPT in zero-shot settings [4]. This suggests that the masked language model pretraining framework may not inherently produce high-quality cell embeddings without additional fine-tuning, highlighting a significant limitation for exploratory research [4].

Decision Framework for Model Selection

Key Selection Criteria

Choosing between scFMs and traditional methods requires careful consideration of multiple factors. The following diagram illustrates the decision workflow:

model_selection start Start Model Selection task Define Analysis Task start->task data Assess Data & Resources task->data goal Identify Primary Goal data->goal trad Choose Traditional Methods goal->trad Standard analysis Limited resources Known cell types scfm Choose scFM Approach goal->scfm Novel discovery Complex predictions Adequate resources eval Evaluate & Iterate trad->eval scfm->eval

Task-Based Recommendations

Different analytical tasks warrant distinct approaches based on empirical performance evidence:

Table 2: Task-specific model recommendations

Task Category Recommended Approach Rationale Use Case Examples
Standard Clustering & Annotation Traditional Methods (HVG + Harmony/scVI) Established reliability, lower computational cost, interpretable results [4] Initial cell type identification, standard atlas construction
Complex Biological Predictions scFMs with Fine-tuning Superior capture of biological relationships, transfer learning capabilities [16] Drug response prediction, cancer cell identification, developmental trajectories
Exploratory Analysis (Unknown Cell Types) Traditional Methods (Zero-shot) More reliable zero-shot performance when ground truth is unavailable [4] Novel cell type discovery, rare cell population identification
Batch Integration Harmony or scVI Consistent performance across diverse datasets and batch effects [4] Multi-dataset integration, cross-study comparisons
Knowledge-Driven Discovery scFMs Better alignment with established biological hierarchies and ontologies [16] Cell lineage relationships, regulatory network inference

Resource Considerations

Implementation complexity varies significantly between approaches, impacting their practical feasibility:

  • Computational Resources: scFMs require substantial GPU memory and processing power for both training and inference, whereas traditional methods can typically run on standard workstations or high-performance CPU clusters [16] [51].
  • Expertise Requirements: Traditional methods have more established best practices and interpretable parameters, while scFMs require specialized knowledge in deep learning and transformer architectures [1] [51].
  • Time Constraints: For rapid prototyping and analysis, traditional methods offer faster turnaround times, while scFMs, particularly those requiring fine-tuning, involve more extensive experimental pipelines [16].

Experimental Protocols

Standardized Evaluation Protocol

To ensure fair comparison between approaches, implement this standardized evaluation protocol:

  • Data Partitioning: Split data into training (70%), validation (15%), and test (15%) sets, ensuring balanced representation of biological conditions and batches [16].
  • Baseline Establishment: Implement traditional methods (HVG selection + Harmony/scVI) as performance baselines using standardized parameters [4].
  • scFM Configuration: For scFMs, extract zero-shot embeddings first, then evaluate fine-tuned performance with limited epoch training (3-5 epochs) [16] [4].
  • Metric Calculation: Apply multiple evaluation metrics including clustering quality (AvgBIO, ASW), batch correction (PCR), and biological consistency (scGraph-OntoRWR) [16] [4].
  • Resource Monitoring: Track computational time, memory usage, and hardware requirements for each approach [51].

Implementation Workflow for scFMs

The following diagram outlines a standardized workflow for implementing and evaluating scFMs:

scfm_workflow data Input scRNA-seq Data preprocess Data Preprocessing & Quality Control data->preprocess tokenize Tokenization (Gene Ranking & Value Embedding) preprocess->tokenize model Select scFM Architecture (Geneformer, scGPT, etc.) tokenize->model zeroshot Zero-Shot Evaluation model->zeroshot finetune Task-Specific Fine-Tuning zeroshot->finetune interpret Biological Interpretation & Validation finetune->interpret

The Scientist's Toolkit

Successful implementation of single-cell analysis requires both wet-lab reagents and computational resources:

Table 3: Essential resources for single-cell analysis workflows

Resource Category Specific Tools/Reagents Function/Purpose Implementation Notes
Wet-Lab Reagents 10x Genomics Chromium System High-throughput single-cell capture and barcoding [48] Enables processing of thousands to millions of cells
Wet-Lab Reagents Smart-seq2/Smart-seq3 Reagents Full-length transcript coverage for alternative splicing analysis [48] [49] Lower throughput but superior transcript characterization
Wet-Lab Reagents Unique Molecular Identifiers (UMIs) Molecular counting and PCR bias correction [48] [49] Critical for accurate quantification; typically 4-8 bp sequences
Computational Tools Scanpy, Seurat Standard pipelines for traditional single-cell analysis [4] [49] Python/R environments respectively
Computational Tools Harmony, scVI Batch effect correction and data integration [4] Essential for multi-dataset analyses
Computational Tools Geneformer, scGPT Foundation model architectures for transfer learning [16] [4] Pretrained models available with specific tokenization schemes
Data Resources CELLxGENE, Human Cell Atlas Curated single-cell data for pretraining and benchmarking [1] Contains >100 million cells across tissues and conditions

The choice between single-cell foundation models and traditional methods represents a strategic decision that should be guided by specific research questions, available resources, and task requirements. Traditional methods remain robust, efficient solutions for standard analytical tasks, particularly in zero-shot scenarios and resource-constrained environments. In contrast, scFMs offer exciting potential for uncovering novel biological insights, especially in complex prediction tasks where their transfer learning capabilities and knowledge capture provide distinct advantages. As the field evolves, the most effective approach will likely involve thoughtful integration of both paradigms, leveraging their complementary strengths to advance single-cell research and therapeutic development.

Rigorous Benchmarking and Comparative Analysis of Model Capabilities

Zero-shot evaluation represents a critical testing ground for single-cell foundation models (scFMs). Unlike fine-tuning, where models are further trained on specific tasks, zero-shot assessment requires models to perform tasks immediately after pretraining, using their learned representations without any additional task-specific training [4]. This approach is vital for biological discovery settings where predefined labels are unavailable, and it provides a rigorous test of whether a model has genuinely learned fundamental biological principles [4] [3]. Recent evaluations have revealed that scFMs often underperform compared to simpler traditional methods in zero-shot settings, highlighting an urgent need for standardized, robust benchmarking practices [4] [5] [3]. This document establishes comprehensive application notes and protocols for zero-shot evaluation of scFMs, providing the research community with standardized datasets, metrics, and experimental frameworks.

Critical Datasets for Benchmarking

A robust zero-shot benchmark requires diverse datasets that represent various biological conditions, technologies, and challenges. The table below summarizes essential characteristics of key benchmarking datasets identified from recent evaluations.

Table 1: Essential Datasets for Zero-Shot scFM Benchmarking

Dataset Name Tissue/Origin Key Characteristics Cell Count (Approx.) Notable Features for Evaluation
Pancreas [4] [16] Pancreas Multiple experimental techniques Varies Significant batch effects between techniques
PBMC (12k) [4] Peripheral Blood Mononuclear Cells Standardized immune cell profiling ~12,000 Technical variation across experiments
Tabula Sapiens [4] [16] Multiple tissues Multiple organ systems ~600,000 Cross-tissue heterogeneity
Immune Cell Atlas [4] Immune cells Diverse immune populations Varies Biological and technical variation
AIDA v2 [16] Multiple tissues Asian immune diversity Varies Independent, unbiased validation
Cancer datasets [16] Multiple cancer types Clinical relevance Varies Intra-tumor heterogeneity

These datasets collectively provide the variation necessary to stress-test scFMs. The Pancreas dataset is particularly valuable for evaluating batch integration capabilities, as it contains data generated using different experimental techniques [4]. Tabula Sapiens offers cross-tissue complexity, while immune cell datasets capture diverse cell states. The inclusion of cancer datasets enables assessment of clinical relevance, and AIDA v2 serves as a completely independent validation set to mitigate risks of data leakage from pretraining corpora [16].

When constructing benchmarks, researchers should consider the potential overlap between evaluation datasets and those used in model pretraining. Some studies have found that scFMs do not consistently outperform baselines even on datasets seen during pretraining, suggesting limitations in how well the pretraining objective aligns with downstream zero-shot tasks [4].

Key Evaluation Metrics and Their Interpretation

Comprehensive zero-shot evaluation requires multiple metrics that capture different aspects of model performance. The following table organizes the essential metrics for scFM evaluation.

Table 2: Key Metrics for Zero-Shot scFM Evaluation

Metric Category Specific Metrics Interpretation and Biological Relevance
Cell Type Clustering Average BIO (AvgBIO) Score [4], Average Silhouette Width (ASW) [4] Measures separation of known cell types in embedding space; higher values indicate better biological relevance
Batch Integration Principal Component Regression (PCR) Score [4], Batch Mixing Scores [4] Quantifies removal of technical artifacts while preserving biological variation; lower PCR indicates better integration
Biological Plausibility scGraph-OntoRWR [16], Lowest Common Ancestor Distance (LCAD) [16] Measures consistency with established biological knowledge from cell ontologies
Perturbation Prediction Perturbation Effect Scores [52] Assesses prediction accuracy of cellular responses to genetic or chemical perturbations
Landscape Analysis Roughness Index (ROGI) [16] Quantifies smoothness of cell-property landscape in latent space; smoother landscapes facilitate downstream task learning

The scGraph-OntoRWR metric represents a significant advancement in evaluating biological relevance. It measures the consistency between cell-type relationships captured by scFM embeddings and established biological knowledge in cell ontologies, providing a knowledge-aware assessment beyond purely statistical measures [16]. Similarly, LCAD evaluates the severity of cell type misannotation by measuring the ontological proximity between misclassified cell types, recognizing that not all annotation errors are equally serious [16].

For perturbation prediction, specialized benchmarks like PertEval-scFM provide standardized frameworks for assessing how well zero-shot embeddings capture information about cellular responses to genetic and chemical perturbations [52]. Performance in this area is particularly important for drug discovery applications.

Experimental Protocols for Zero-Shot Evaluation

Core Zero-Shot Evaluation Workflow

The following diagram illustrates the standardized workflow for zero-shot evaluation of single-cell foundation models:

G Start Start Evaluation LoadModel Load Pretrained scFM Start->LoadModel InputData Input Benchmark Dataset LoadModel->InputData GenerateEmbeddings Generate Cell Embeddings (Zero-Shot) InputData->GenerateEmbeddings EvalClustering Evaluate Cell Type Clustering GenerateEmbeddings->EvalClustering EvalBatch Evaluate Batch Integration GenerateEmbeddings->EvalBatch EvalBio Evaluate Biological Plausibility GenerateEmbeddings->EvalBio CompareBaselines Compare Against Baseline Methods EvalClustering->CompareBaselines EvalBatch->CompareBaselines EvalBio->CompareBaselines Report Generate Benchmark Report CompareBaselines->Report

Zero-Shot scFM Evaluation Workflow

Protocol 1: Cell Type Clustering Evaluation

Purpose: To assess the ability of scFM embeddings to separate known cell types without additional training.

Materials:

  • Pretrained scFM (e.g., scGPT, Geneformer, UCE, scFoundation)
  • Benchmark dataset with ground truth cell type labels
  • Baseline methods (HVG selection, Harmony, scVI)
  • Computing environment with adequate GPU resources

Procedure:

  • Data Preparation: Standardize the input dataset using the scFM's predefined preprocessing pipeline. Ensure no dataset-specific normalization is applied that could constitute implicit fine-tuning.
  • Embedding Generation: Pass each cell through the scFM in inference mode to extract the cell embeddings. For transformer models, this is typically the [CLS] token embedding or mean of all token embeddings.
  • Dimensionality Reduction: Apply uniform manifold approximation and projection (UMAP) to reduce embeddings to two dimensions for visualization.
  • Clustering Analysis: Perform Leiden clustering on the embeddings without using ground truth labels.
  • Metric Calculation: Compute clustering metrics including:
    • Average BIO score (AvgBIO) to measure cell type separation
    • Average silhouette width (ASW) for cluster compactness
    • Adjusted Rand Index (ARI) for similarity to ground truth
  • Baseline Comparison: Repeat steps 2-5 with baseline methods including highly variable genes (HVG) selection, Harmony, and scVI.

Interpretation: Superior scFM performance should demonstrate consistently high scores across multiple datasets and metrics. Current evidence suggests that HVG selection often outperforms scFMs in zero-shot settings, providing a critical baseline for comparison [4].

Protocol 2: Batch Integration Assessment

Purpose: To evaluate how well scFM embeddings remove technical batch effects while preserving biological variation.

Materials:

  • Benchmark dataset with significant batch effects (e.g., Pancreas dataset with multiple experimental techniques)
  • Same baseline methods as Protocol 1
  • Batch integration metrics suite

Procedure:

  • Dataset Selection: Choose a dataset with pronounced batch effects from multiple sources or technologies.
  • Embedding Generation: Generate cell embeddings using the scFM as in Protocol 1.
  • Visual Assessment: Create UMAP plots colored by batch and cell type to qualitatively assess integration.
  • Quantitative Metrics: Calculate:
    • Principal Component Regression (PCR) score: proportion of variance explained by batch
    • Batch mixing scores: how well batches are intermixed within cell types
    • Biological conservation scores: preservation of cell type separation after integration
  • Comparative Analysis: Compare results against Harmony and scVI, which represent state-of-the-art batch integration methods.

Interpretation: Effective batch integration should show low PCR scores (minimal batch effect) while maintaining clear separation of biologically distinct cell types. Current evaluations indicate that scFMs often struggle with batch integration, sometimes showing higher batch effects than the original data [4].

Protocol 3: Biological Plausibility Evaluation

Purpose: To assess whether scFM embeddings capture biologically meaningful relationships consistent with established knowledge.

Materials:

  • Cell ontology resources (e.g., Cell Ontology)
  • scGraph-OntoRWR implementation [16]
  • Gene ontology databases

Procedure:

  • Embedding Generation: Generate cell embeddings as in previous protocols.
  • Cell-Type Relationship Mapping: Calculate pairwise distances between cell-type centroids in the embedding space.
  • Ontology-Based Random Walk: Implement the scGraph-OntoRWR algorithm:
    • Construct a graph from cell ontology relationships
    • Perform random walks with restarts from each cell type
    • Measure correlation between ontology-based and embedding-based similarity
  • LCAD Calculation: For cell type annotation tasks, compute the Lowest Common Ancestor Distance between misclassified cell types to assess semantic severity of errors.
  • Gene Function Analysis: Evaluate gene embeddings for functional coherence using gene ontology enrichment.

Interpretation: High scGraph-OntoRWR scores indicate that the embedding space reflects established biological knowledge. The LCAD metric provides nuanced evaluation of annotation errors, recognizing that confusing closely related cell types is less severe than confusing distantly related ones [16].

The Scientist's Toolkit: Essential Research Reagents

Table 3: Essential Research Reagents and Computational Tools for scFM Benchmarking

Tool/Resource Type Function in Evaluation Access Information
CZ CELLxGENE [1] Data Platform Provides standardized access to millions of single-cell datasets Publicly available at cellxgene.cziscience.com
Geneformer [4] [16] scFM Transformer-based model for single-cell analysis Available through Hugging Face
scGPT [4] [16] scFM Generative pretrained transformer for single-cell data GitHub repository
Harmony [4] [16] Integration Method Baseline for batch integration evaluation R/Python packages
scVI [4] [16] Generative Model Baseline for probabilistic modeling of scRNA-seq data Python package
PertEval-scFM [52] Benchmark Framework Specialized evaluation of perturbation prediction GitHub repository
AIDA v2 [16] Benchmark Dataset Independent validation dataset for unbiased evaluation Available through CELLxGENE

Analysis and Future Directions

Current zero-shot evaluations reveal significant limitations in scFMs. Multiple studies have demonstrated that these models often fail to outperform simpler baselines across various tasks, including cell type clustering and batch integration [4] [5] [3]. The masked language model pretraining objective, while successful in NLP, may not be optimally aligned with biological learning for single-cell data [3]. Furthermore, models show inconsistent performance even on datasets included in their pretraining corpora, suggesting fundamental limitations in how they capture and retain biological information [4].

The relationship between pretraining dataset scale and model performance appears complex. While some evidence suggests that increased pretraining data confers benefits, there may be diminishing returns, with larger datasets not necessarily translating to better zero-shot capabilities [4]. This highlights the need for improved pretraining strategies rather than simply scaling dataset size.

Future benchmark development should prioritize several key areas: First, creating more challenging evaluation tasks that require deeper biological reasoning, such as predicting cellular responses to novel perturbations [52] [31]. Second, developing better metrics that directly measure biological insight rather than just statistical patterns. Third, establishing rigorous standards to prevent data leakage between pretraining and evaluation sets. Finally, creating more nuanced evaluations that consider the practical contexts in which scFMs will be deployed, particularly for clinical and drug discovery applications [16] [31].

As the field matures, benchmarks must evolve beyond simple performance comparisons to provide diagnostic insights into why models succeed or fail. This will require closer integration of biological expertise in benchmark design and interpretation, ensuring that evaluations measure not just statistical patterns but meaningful biological understanding.

Single-cell RNA sequencing (scRNA-seq) has fundamentally transformed biological research by enabling the characterization of gene expression at the level of individual cells. A cornerstone of scRNA-seq analysis is cell type clustering, the process of grouping cells based on transcriptional similarity to identify distinct cellular populations. The emergence of single-cell foundation models (scFMs)—large-scale deep learning models pretrained on millions of cells—promises a new paradigm for this task. These models, including Geneformer and scGPT, are designed to learn universal biological principles from vast data corpora, which can then be applied to various downstream analyses, ideally without additional task-specific training (a "zero-shot" setting) [1].

This application note provides a structured, evidence-based comparison of these novel scFMs against established traditional methods for cell type clustering. We focus on a zero-shot evaluation framework, which is critical for discovery-driven research where cell type labels are unknown and fine-tuning is impractical [4]. We synthesize findings from recent, rigorous benchmarks to guide researchers and drug development professionals in selecting the most effective and reliable methods for their specific experimental contexts.

Quantitative Performance Comparison

Recent comprehensive benchmarking studies have evaluated the performance of scFMs against traditional methods on multiple datasets with known cell type labels. Performance is typically measured using clustering metrics like the Average BIO score (AvgBio) and Average Silhouette Width (ASW), which assess how well the clusters match the true biological labels.

Table 1: Zero-shot Cell Type Clustering Performance (AvgBio Score) [4]

Method Category Specific Method PBMC (12k) Tabula Sapiens Pancreas Immune Dataset
Single-cell Foundation Models (scFMs) Geneformer Underperforms Baselines Underperforms Baselines Underperforms Baselines Underperforms Baselines
scGPT Comparable to scVI Underperforms HVG/ScVI Underperforms HVG/ScVI Underperforms HVG/ScVI
Traditional Methods HVG (Selection) Outperforms scFMs Outperforms scFMs Outperforms scFMs Outperforms scFMs
Harmony Outperforms scFMs Outperforms scFMs Outperforms scFMs Outperforms scFMs
scVI Outperforms scFMs Outperforms scFMs Outperforms scFMs Outperforms scFMs

A key finding across multiple studies is that in a zero-shot setting, traditional methods consistently match or surpass the performance of scFMs on cell type clustering. Notably, a simple baseline method like selecting Highly Variable Genes (HVG) often outperforms both Geneformer and scGPT [4] [3]. More advanced traditional methods, such as the deep learning-based scVI and the linear transformation-based Harmony, also demonstrate superior and more reliable clustering accuracy across diverse tissues and technologies [4].

Table 2: Overall Method Characteristics for Cell Type Clustering [16] [4] [53]

Method Clustering Accuracy (Zero-shot) Batch Integration Computational Efficiency Interpretability Ideal Use Case
Geneformer Limited Poor Moderate Low Tasks requiring fine-tuning
scGPT Variable Moderate High resource demands Low Exploratory analysis on similar data
HVG Selection Good Limited Very High High Fast initial analysis on well-standardized data
Harmony Good Excellent High Medium Integrating multiple datasets with strong batch effects
scVI Good Excellent Moderate (requires GPU) Medium Large-scale data integration; downstream generative tasks

Experimental Protocols for Performance Benchmarking

To ensure reproducible and fair comparisons between methods, researchers should adhere to standardized benchmarking protocols. The following section outlines the experimental workflow and detailed methodologies used in the cited studies.

The following diagram illustrates the standard workflow for benchmarking single-cell clustering methods, from data input to performance evaluation.

G cluster_methods Methods Benchmark Input Input scRNA-seq Dataset (Count Matrix & Metadata) Preprocessing Data Preprocessing (QC, Normalization, HVG Selection) Input->Preprocessing MethodGroup Method Application & Embedding Generation Preprocessing->MethodGroup scFM_Trad Preprocessing->scFM_Trad Clustering Clustering Algorithm (e.g., Leiden, k-means) MethodGroup->Clustering Evaluation Performance Evaluation (Metrics: ARI, NMI, ASW, Bio Score) Clustering->Evaluation Invisible scFM_Trad->Invisible scFMs scFMs (Zero-shot) • Geneformer • scGPT scFM_Trad->scFMs Traditional Traditional Methods • HVG Selection • Harmony • scVI scFM_Trad->Traditional scFMs->Clustering Traditional->Clustering

Protocol 1: Zero-shot Clustering with Precomputed Embeddings

This protocol evaluates the intrinsic quality of cell representations generated by models without any task-specific training [4].

  • Input Data: A processed scRNA-seq dataset with ground truth cell type labels.
  • Feature Extraction:
    • For scFMs (Geneformer/scGPT): Generate cell embeddings in a zero-shot manner using the publicly available pretrained models without any fine-tuning on the target dataset.
    • For Traditional Methods:
      • HVG Selection: Reduce the dataset to the top 2,000 highly variable genes.
      • Harmony: Apply Harmony to the principal components (PCs) of the gene expression matrix to obtain integrated PCs.
      • scVI: Train a scVI model on the raw count data and extract the latent representation.
  • Clustering: Apply a standard clustering algorithm (e.g., Leiden, k-means) to the embeddings from each method. Use fixed hyperparameters (like resolution) across all methods for a fair comparison.
  • Evaluation: Compare the clustering results to the ground truth labels using metrics like Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Average BIO score (AvgBio) [4] [54].

Protocol 2: Benchmarking Batch Integration Capability

This protocol assesses a model's ability to mix cells from different batches while preserving biological distinctness, a key challenge in single-cell analysis [4] [55].

  • Input Data: Select a benchmark dataset with known, strong batch effects (e.g., the Pancreas dataset with five different technologies).
  • Embedding Generation: Generate cell embeddings using all methods (as in Protocol 1).
  • Visualization & Quantitative Metrics:
    • Generate UMAP plots colored by batch and by cell type.
    • Calculate quantitative integration metrics:
      • iLISI (Integration LISI): Measures the effective number of datasets/batches in a local neighborhood. A higher score indicates better batch mixing [56] [55].
      • cLISI (Cell-type LISI): Measures the effective number of cell types in a local neighborhood. A score close to 1 indicates good separation of cell types [56].
      • Principal Component Regression (PCR): Quantifies the proportion of variance in the embeddings explained by batch after correcting for cell type.

Successful execution of the benchmarking protocols requires a suite of computational tools and data resources. The table below details key solutions used in the featured studies.

Table 3: Key Research Reagent Solutions for Single-Cell Clustering Benchmarking

Category Item / Software Function / Description Key Features
Foundation Models Geneformer [16] [4] Transformer model pretrained on 30M cells; uses gene ranking for tokenization. Emergent network insights; fine-tuning for target tasks.
scGPT [16] [4] [1] Transformer model pretrained on 33M cells; supports multi-omics. Generative capabilities; cell-centric pretraining.
Traditional Methods Harmony [4] [56] [55] Fast, iterative integration algorithm for removing batch effects. High speed and low memory use; operates on PCs.
scVI [4] [53] [55] Deep generative model for scRNA-seq data based on variational autoencoders. Probabilistic modeling; handles raw counts.
HVG Selection [4] [53] Basic feature selection to retain most variable genes. Simple, fast, and highly effective baseline.
Data Resources CELLxGENE [16] [1] Curated atlas of single-cell data. Source of standardized datasets for training/evaluation.
AIDA v2 [16] Asian Immune Diversity Atlas; used for unbiased validation. Independent dataset to mitigate data leakage risks.
Evaluation Metrics LISI (iLISI/cLISI) [56] [55] Metrics for evaluating batch mixing and cell type separation. Local assessment of integration quality.
ARI / NMI [54] Metrics comparing clustering result to ground truth labels. Standard measures for clustering accuracy.

The evidence demonstrates that there is no single "best" method universally superior for all clustering scenarios. The choice depends on the specific research context, goals, and constraints. The following decision diagram synthesizes the benchmark findings into a practical guide for method selection.

G Start Start: Objective for Cell Type Clustering Q1 Is this an exploratory analysis with no labeled data for fine-tuning? Start->Q1 Q4 Is the goal to discover novel biology beyond predefined cell types? Q1->Q4 No (Can Fine-tune) A1 Recommended: Traditional Methods (HVG, Harmony, scVI) Q1->A1 Yes (Zero-shot) Q2 Is the primary challenge integrating multiple datasets with strong batch effects? Q3 Are computational speed and simplicity top priorities? Q2->Q3 No A2 Recommended: Harmony or scVI Excels at batch integration while preserving biological variance Q2->A2 Yes Q3->A2 No A3 Recommended: HVG Selection Fast, interpretable, and often outperforms complex scFMs Q3->A3 Yes A4 Consider: scFMs with Caution Potential for novel insights, but validate findings with other methods Q4->A4 Yes A1->Q2

Current evidence indicates that for the critical task of zero-shot cell type clustering, traditional methods like Harmony, scVI, and even simple HVG selection provide more robust, accurate, and computationally efficient results than the current generation of single-cell foundation models [4] [3]. While scFMs represent a promising architectural advance and may excel in other tasks like perturbation prediction [18] or when fine-tuned, their zero-shot embeddings do not yet consistently capture biological reality for clustering as effectively as established techniques.

Therefore, for researchers and drug development professionals, the recommended practice is to use traditional methods as the primary tool for cell type discovery and atlas construction. scFMs should be approached as emerging technologies; their results should be rigorously validated against traditional method outputs and biological priors. Future developments in model architecture, pretraining objectives, and data curation are needed to close this performance gap and realize the full potential of foundation models in single-cell biology [16] [1].

The emergence of single-cell foundation models (scFMs) represents a paradigm shift in computational biology, offering unprecedented capacity to analyze cellular heterogeneity and function. However, a critical challenge persists: how to rigorously evaluate whether these models capture biologically meaningful patterns beyond mere technical performance on computational tasks. Traditional metrics for clustering accuracy or batch integration often fail to assess the biological relevance of learned representations. This gap has prompted the development of novel ontology-informed evaluation metrics, particularly scGraph-OntoRWR, which quantifies the alignment between computational model outputs and established biological knowledge [16] [57]. These metrics introduce a crucial biological ground truth into model assessment, enabling researchers to determine whether scFMs truly understand cellular biology or merely excel at pattern recognition without semantic understanding.

The integration of biological ontologies provides the formal scaffolding necessary for this evaluation approach. Biological ontologies are structured, controlled vocabularies that capture hierarchical relationships between biological entities—from genes and proteins to cell types and physiological processes [57]. By leveraging these comprehensive knowledge structures, researchers can now quantitatively measure how well the relational patterns discovered by scFMs correspond to biologically verified relationships. This approach is particularly valuable for evaluating zero-shot learning capabilities in scFMs, where models must generalize to novel datasets without task-specific fine-tuning [4].

Biological Ontologies: The Framework for Knowledge Representation

Foundations of Biological Knowledge Representation

Biological ontologies provide a formal, explicit specification of shared conceptualizations within the biological domain, capturing not just definitions but the intricate logical relationships between biological concepts [57]. Unlike simple databases or glossaries, ontologies structure knowledge through standardized relationship types such as "isa" (denoting classification hierarchies), "partof" (representing mereological relationships), and "participates_in" (connecting entities to processes). The Open Biological and Biomedical Ontology (OBO) Foundry represents a major community effort to coordinate ontology development across biological sciences, establishing best practices and standardized relationship definitions to ensure interoperability and logical consistency [57].

Two fundamental concept types form the bedrock of most biological ontologies. Continuants are entities that persist through time while maintaining their identity, such as molecules, cells, tissues, and organs. Occurrents are time-dependent entities including processes, actions, and states—for example, biochemical reactions, cell division, or disease progression [57]. This distinction is crucial for proper knowledge representation, as it helps avoid common modeling errors, such as confusing a physical structure with the processes it participates in.

Ontologies in Single-Cell Biology

In single-cell biology, ontologies provide essential organization for the extremely complex and high-dimensional data generated by technologies like scRNA-seq. Cell ontologies specifically define cell types and their relationships in a standardized hierarchy, capturing developmental lineages and functional classifications [57]. For example, a cell ontology would specify that a "cardiac muscle cell" is a subtype of "muscle cell," which in turn is a subtype of "animal cell," while also representing that it is "partof" the heart and "participatesin" muscle contraction processes.

These structured relationships provide the biological ground truth against which computational models can be evaluated. When a model represents two cell types as similar, ontology-based metrics can determine whether this computational similarity reflects established biological relationships—such as developmental lineage or functional similarity—or represents biologically nonsensical associations [16].

Novel Metrics for Evaluating Biological Insight

The scGraph-OntoRWR Metric

The scGraph-OntoRWR (Single-Cell Graph-Ontology Random Walk with Restart) metric represents a significant advancement in evaluating the biological relevance of scFM embeddings [16]. This innovative metric operates by comparing the relational structure between cell types learned by computational models against the known hierarchical structure encoded in biological ontologies.

The metric employs a random walk with restart algorithm on a cell-cell similarity graph constructed from model embeddings. This algorithm simulates a random traverser that moves between similar cells in the computational embedding space, with occasional restarts to maintain locality. The resulting visitation probabilities capture the implicit relational structure that the model has learned between different cell types [16].

Simultaneously, the same random walk process is applied to the formal cell ontology, where relationships are biologically validated and semantically meaningful. By comparing the probability distributions generated from the computational embeddings against those from the formal ontology, scGraph-OntoRWR quantifies the consistency between model-derived cell relationships and established biological knowledge [16]. A high scGraph-OntoRWR score indicates that the computational model has learned to represent cell types in a manner that respects their known biological relationships, suggesting genuine biological insight rather than merely technical pattern recognition.

Lowest Common Ancestor Distance (LCAD)

The Lowest Common Ancestor Distance (LCAD) metric provides a complementary approach to evaluating model errors in biologically meaningful terms [16]. Rather than treating all misclassifications equally, LCAD assesses the severity of cell type annotation errors by measuring their distance within the ontological hierarchy.

When a model misclassifies a cell type, LCAD calculates how closely related the predicted and actual cell types are within the ontology by identifying their lowest common ancestor and measuring the ontological proximity between them [16]. For example, misclassifying a "T helper cell" as a "cytotoxic T cell" represents a less severe error than misclassifying it as a "neuron," as T cell subtypes share a more recent common ancestor in the cell ontology. The former error might reflect incomplete learning of fine-grained distinction, while the latter suggests a fundamental failure to capture major cell lineage differences.

This ontology-informed error assessment provides crucial context for model evaluation, helping researchers distinguish between biologically reasonable mistakes and nonsensical predictions [16]. By incorporating LCAD alongside traditional accuracy metrics, researchers gain a more nuanced understanding of model performance that respects biological reality.

Performance Benchmarks of Single-Cell Foundation Models

Recent comprehensive benchmarking studies have applied these novel metrics to evaluate six prominent single-cell foundation models (Geneformer, scGPT, UCE, scFoundation, LangCell, and scCello) across diverse tasks including batch integration, cell type annotation, cancer cell identification, and drug sensitivity prediction [16]. The results reveal that no single scFM consistently outperforms others across all tasks, emphasizing that model selection must be tailored to specific applications and data characteristics.

Table 1: Performance Ranking of Single-Cell Foundation Models Across Biological Tasks [16]

Model Batch Integration Cell Type Annotation Cancer ID Drug Sensitivity Overall Ranking
Geneformer 2 3 1 2 2
scGPT 3 2 3 3 3
UCE 1 4 4 4 4
scFoundation 4 1 2 1 1
Traditional ML 5 5 5 5 6
HVG Selection 6 6 6 6 5

The benchmarking demonstrated that foundation models generally show remarkable robustness and versatility across diverse applications, while simpler machine learning models sometimes adapt more efficiently to specific datasets, particularly under resource constraints [16]. Notably, the pretrained zero-shot scFM embeddings captured meaningful biological insights into the relational structure of genes and cells, which proved beneficial for downstream tasks. Performance improvements correlated with what researchers termed "a smoother landscape" in the pretrained latent space, reducing the difficulty of training task-specific models [16].

Table 2: The Scientist's Toolkit: Essential Research Reagents and Resources [57]

Reagent/Resource Function Biological Significance
Gene Embeddings Numerical representations of genes in latent space Capture functional similarities between genes based on co-expression patterns across diverse cellular contexts
Cell Ontologies Structured vocabularies defining cell types and relationships Provide ground truth for evaluating biological relevance of model outputs
Attention Mechanisms Model components that identify important relationships between inputs Reveal gene-gene interactions and regulatory relationships learned from data
Benchmark Datasets Curated single-cell data with high-quality annotations Enable standardized evaluation and comparison of different modeling approaches
GO Term Annotations Gene Ontology functional classifications Serve as biological prior knowledge for validating gene embeddings

Experimental Protocols for Knowledge Alignment Assessment

Protocol: Implementing scGraph-OntoRWR Evaluation

Objective: Quantify the alignment between cell-type relationships learned by a single-cell foundation model and established biological knowledge encoded in cell ontologies.

Materials and Reagents:

  • Pre-trained single-cell foundation model (e.g., Geneformer, scGPT, scFoundation)
  • Reference single-cell dataset with high-quality cell type annotations (e.g., from CELLxGENE)
  • Cell ontology (e.g., Cell Ontology from OBO Foundry)
  • Computational environment with Python and libraries including scanpy, scikit-learn, and ontology processing packages

Procedure:

  • Embedding Generation:
    • Process the reference dataset through the scFM in zero-shot mode to generate cell embeddings without any fine-tuning.
    • Normalize embeddings using L2 normalization to ensure comparable distance metrics.
  • Cell-Cell Graph Construction:

    • Construct a k-nearest neighbor graph (k=15) from the normalized embeddings using cosine similarity.
    • Convert the kNN graph to an adjacency matrix with edge weights representing similarity scores.
  • Ontology Graph Processing:

    • Download the current Cell Ontology in OWL format.
    • Extract the "isa" and "partof" relationships to construct a hierarchical graph structure.
    • Convert the ontology hierarchy to an adjacency matrix where connections represent ontological relationships.
  • Random Walk with Restart Execution:

    • Implement RWR algorithm with restart probability r=0.3 on both the embedding-derived graph and ontology graph.
    • For each cell type, initiate RWR from 10 representative seed cells.
    • Run until convergence (Δ < 1e-6 between iterations) to obtain stable probability distributions.
  • Similarity Calculation:

    • Compute Jensen-Shannon divergence between the RWR probability distributions from the model and ontology for each cell type.
    • Convert divergences to similarity scores using exponential transformation.
    • Calculate final scGraph-OntoRWR score as mean similarity across all cell types.

Validation:

  • Compare scGraph-OntoRWR scores across multiple models and datasets.
  • Perform statistical significance testing using paired t-tests across cell types.
  • Correlate scGraph-OntoRWR scores with traditional biological metrics like marker gene expression.

Protocol: Assessing Misclassification Severity with LCAD

Objective: Evaluate cell type annotation errors in ontologically meaningful terms rather than treating all errors equally.

Materials and Reagents:

  • Cell type predictions from scFM on benchmark dataset
  • Ground truth cell type annotations
  • Cell ontology (e.g., Cell Ontology)
  • Ontology processing toolkit (e.g., pronto)

Procedure:

  • Error Identification:
    • Compare model predictions against ground truth annotations to identify misclassified cells.
    • For each misclassification, record both the true cell type and predicted cell type.
  • Ontological Distance Calculation:

    • For each misclassification pair, find the lowest common ancestor (LCA) in the cell ontology.
    • Calculate the shortest path distance from both the true and predicted cell types to the LCA.
    • Sum these distances to obtain the LCAD value for each error.
  • Score Aggregation:

    • Compute mean LCAD across all misclassifications for a model.
    • Calculate distribution statistics (median, standard deviation) to understand error patterns.
    • Compare LCAD values against random baseline expectation.
  • Biological Interpretation:

    • Categorize errors by severity: low LCAD (ontologically related cell types) vs. high LCAD (distantly related cell types).
    • Identify systematic error patterns that might indicate specific model limitations.

hierarchy Cell Cell Muscle Cell Muscle Cell Animal Cell Animal Cell Muscle Cell->Animal Cell Neuron Neuron Neuron->Animal Cell Animal Cell->Cell Cardiac Muscle Cell Cardiac Muscle Cell Cardiac Muscle Cell->Muscle Cell Skeletal Muscle Cell Skeletal Muscle Cell Skeletal Muscle Cell->Muscle Cell T Cell T Cell T Cell->Animal Cell Cytotoxic T Cell Cytotoxic T Cell Cytotoxic T Cell->T Cell T Helper Cell T Helper Cell T Helper Cell->T Cell

Diagram Title: Cell Ontology Hierarchy for LCAD Calculation

Application Notes for Drug Discovery and Development

Enhancing Target Identification and Validation

In pharmaceutical research, scGraph-OntoRWR provides a crucial framework for evaluating whether scFMs can correctly identify and represent disease-relevant cell states. When applied to tumor microenvironment data, this metric can verify that models maintain proper distinctions between immune cell subtypes while recognizing their functional relationships [16]. This capability is particularly valuable for identifying novel therapeutic targets within complex tissues, where understanding cellular relationships is essential for predicting on-target effects and potential toxicities.

For example, when analyzing scRNA-seq data from cancer biopsies, researchers can use scGraph-OntoRWR to ensure that models correctly cluster tumor-infiltrating lymphocytes by subtype while maintaining their ontological relationship to broader immune cell classes. A model with high scGraph-OntoRWR scores would be more trustworthy for identifying rare but therapeutically relevant cell populations, such as exhausted T cells or tumor-associated macrophages in specific functional states [16].

Accelerating Drug Repurposing Through Cross-Domain Alignment

Knowledge graphs have emerged as powerful tools for drug repurposing, organizing complex relationships between drugs, targets, diseases, and side effects [58] [59]. The principles underlying scGraph-OntoRWR can be extended to evaluate how well scFMs align with these pharmacological knowledge structures, creating opportunities for drug repurposing through cross-domain knowledge alignment.

By treating drug-disease relationships as a form of ontology, researchers can adapt the scGraph-OntoRWR methodology to assess how well model representations of drug-treated cells reflect known therapeutic mechanisms. For instance, a model that correctly represents that cardiac muscle cells and neurons share distant ontological relationships would be less likely to suggest cardiotoxic compounds for neurological disorders, potentially flagging safety issues earlier in the drug discovery process [59].

Addressing Limitations and Future Directions

While scGraph-OntoRWR represents a significant advance in biological evaluation of scFMs, several limitations remain. The metric depends heavily on the completeness and accuracy of the underlying ontologies, which may have gaps for rare cell types or newly discovered biological relationships [57]. Additionally, current implementations focus primarily on cell type relationships, with less emphasis on functional states or spatial contexts.

Future developments may extend these approaches to incorporate dynamic biological processes, multi-omics integrations, and causal relationship modeling. As noted in expert opinion, "Many popular link prediction algorithms fail to address strong biases in biomedical data, and only highlight biological associations, failing to model causal relationships in complex dynamic biological systems" [58]. Addressing these limitations will further enhance the utility of ontology-informed metrics for evaluating biological insight in computational models.

workflow Single-Cell Data Single-Cell Data Foundation Model Foundation Model Single-Cell Data->Foundation Model Cell Embeddings Cell Embeddings Foundation Model->Cell Embeddings Cell-Cell Graph Cell-Cell Graph Cell Embeddings->Cell-Cell Graph RWR on Model Graph RWR on Model Graph Cell-Cell Graph->RWR on Model Graph Ontology Graph Ontology Graph RWR on Ontology RWR on Ontology Ontology Graph->RWR on Ontology Similarity Comparison Similarity Comparison RWR on Model Graph->Similarity Comparison RWR on Ontology->Similarity Comparison scGraph-OntoRWR Score scGraph-OntoRWR Score Similarity Comparison->scGraph-OntoRWR Score Cell Ontology Cell Ontology Cell Ontology->Ontology Graph

Diagram Title: scGraph-OntoRWR Calculation Workflow

The deployment of single-cell Foundation Models (scFMs) in a zero-shot setting—where models make predictions on novel data without any further task-specific training—is a critical test for their use in biological discovery. This application note synthesizes recent benchmarking studies to evaluate the zero-shot capabilities of leading scFMs, including Geneformer, scGPT, scFoundation, and LangCell. The analysis reveals that while these models hold immense promise for tasks like cell type annotation and batch integration, their zero-shot performance often fails to exceed that of simpler, established baseline methods. Performance is context-dependent, influenced by factors such as pretraining data composition and architectural choices. The following sections provide a detailed comparative analysis, standardized evaluation protocols, and actionable guidance for researchers aiming to incorporate these tools into their workflows.

Single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to study cellular heterogeneity at an unprecedented resolution. The analysis of this high-dimensional data presents significant computational challenges, spurring the development of specialized single-cell Foundation Models (scFMs). These models are pre-trained on millions of single-cell gene expression profiles with the goal of learning universal patterns in transcriptional regulation [60] [3]. A key claimed advantage of scFMs is their potential for zero-shot learning—the ability to generalize to new datasets and tasks without requiring additional training data or fine-tuning. This capability is particularly valuable in exploratory biology, where predefined labels for cell types or states may be unavailable [4]. This document assesses the zero-shot performance of several prominent scFMs, providing a framework for their practical application and evaluation.

Model Architectures and Pre-training Strategies

The foundational knowledge of an scFM is largely determined by its architecture and the data and objectives used during pre-training. The table below summarizes the key characteristics of the evaluated models.

Table 1: Architectural and Pre-training Specifications of Leading scFMs

Model Pre-training Data Input Size (Genes) Key Architectural Features Pre-training Objective(s)
Geneformer [61] 29.9M human cells (v1); 95M human cells (v2) 2,048 (v1); 4,096 (v2) Transformer; Rank-value gene encoding; Cell embedding (v1) or CLS token embedding (v2) Masked gene prediction
scGPT [60] [62] 33M non-cancerous human cells (scGPT-human) Full gene set Transformer; Employs batch and condition tokens Masked gene prediction
scFoundation [63] Information missing Information missing Transformer-based Information missing
LangCell [64] Information missing Information missing Language-Cell pre-training framework; Unified representation of single-cell data and natural language Incorporates text descriptions with discriminative and generative objectives
scMMGPT [60] 27M human cells + textual data Full gene set Multimodal (scRNA-seq + text); Bidirectional projectors; Two-stage pre-training Discriminative (cell-text alignment) and Generative (text reconstruction)

A notable trend is the move towards multimodal integration. While earlier models like Geneformer and scGPT rely solely on transcriptomic data, newer approaches like LangCell and scMMGPT explicitly incorporate textual knowledge (e.g., cell type definitions from Wikipedia and OBO Foundry) during pre-training. This aims to ground the model's representations in rich, human-curated biological semantics [60] [64]. Another key differentiator is how models handle the input data; some, like Geneformer, use a fixed subset of genes ranked by expression, whereas others, like scGPT and scMMGPT, are designed to process the full quantitative expression profile to minimize information loss [60].

Quantitative Performance Benchmarking

Rigorous benchmarking is essential to understand the real-world utility of these models. The following tables consolidate quantitative results from recent independent evaluations, focusing on zero-shot performance for core tasks in single-cell analysis.

Zero-Shot Cell Type Clustering

Cell type clustering in a zero-shot setting involves using a model's embedding to group cells of the same type without any fine-tuning on the target dataset. Performance is measured by how well the embeddings separate known cell types.

Table 2: Zero-shot Cell Type Clustering Performance (Average BIO Score) [4] [3]

Model / Baseline Pancreas Dataset Tabula Sapiens Dataset Immune Dataset PBMC (12k) Dataset
Highly Variable Genes (HVG) 0.78 0.75 0.72 0.69
Harmony 0.75 0.71 0.70 0.67
scVI 0.74 0.73 0.68 0.68
scGPT 0.65 0.66 0.63 0.71
Geneformer 0.58 0.55 0.52 0.56
Random scGPT 0.51 0.50 0.49 0.52

Key Insights:

  • Simpler methods often outperform scFMs. The heuristic baseline of selecting Highly Variable Genes (HVG) consistently outperformed or matched all foundation models across most datasets [4] [3].
  • Performance is dataset-dependent. scGPT showed competitive performance on the PBMC dataset but lagged on others. The benchmarking also indicated that models do not consistently perform better on datasets that were part of their pre-training corpus [4].
  • Pre-training provides a marginal benefit. While pre-trained scGPT models performed better than a randomly initialized version, this improvement was not sufficient to surpass established baselines like scVI and Harmony in most cases [4].

Zero-Shot Batch Integration

Batch integration evaluates a model's ability to produce embeddings where cells of the same type cluster together, regardless of technical artifacts from different experiments or donors.

Table 3: Batch Integration Performance (Batch Mixing Score) [4]

Model / Baseline Pancreas Dataset PBMC Dataset Tabula Sapiens Dataset Immune Dataset
Highly Variable Genes (HVG) 0.85 0.88 0.82 0.80
scVI 0.81 0.85 0.75 0.72
Harmony 0.78 0.80 0.70 0.78
scGPT 0.72 0.75 0.78 0.77
Geneformer 0.45 0.48 0.41 0.43

Key Insights:

  • HVG remains a strong baseline. Once again, the simple HVG method achieved the highest scores, indicating that current scFMs do not inherently learn to remove batch effects more effectively than basic feature selection in a zero-shot setting [4].
  • Geneformer struggles with batch effects. Geneformer's embeddings consistently showed a high proportion of variance explained by batch effects, performing worse than other methods [4].
  • scGPT shows variable performance. scGPT's performance was more competitive, sometimes outperforming scVI or Harmony on datasets with complex biological batch effects (e.g., different donors), though these were also datasets potentially seen during its pre-training [4].

Fine-Tuned Cell Type Annotation

While zero-shot performance is critical for discovery, fine-tuning on labeled data is a common application. The table below shows the performance of Geneformer models after fine-tuning a classifier on their embeddings.

Table 4: Fine-tuned Cell Type Annotation Performance (F1 Score) [61]

Model Cross-Tissue Immune Atlas (LVL1) CITE-seq Yolk Sac (LVL1) CITE-seq Yolk Sac (LVL3 - High Resolution)
Geneformer v1 0.72 0.81 0.21
Geneformer v2 (Base) 0.85 0.89 0.42
Geneformer v2 (Cancer-tuned) 0.86 0.90 0.43

Key Insights:

  • Architectural improvements matter. Geneformer v2, with its larger pre-training corpus, increased input size, and use of a CLS token, significantly outperforms v1, especially in fine-tuned settings [61].
  • High-resolution annotation is challenging. All models perform worse on Level 3 (finer) cell type annotations, but v2 shows a 2x improvement in F1 score over v1, demonstrating its ability to capture more nuanced biological information [61].

Detailed Experimental Protocols

This section outlines standardized protocols for reproducing key benchmarking experiments, enabling researchers to validate model performance on their own datasets.

Protocol 1: Zero-Shot Cell Type Clustering Evaluation

Objective: To assess the quality of a model's cell embeddings for separating known cell types without any fine-tuning.

Materials:

  • Processed single-cell dataset (e.g., from CellxGene) with annotated cell types.
  • Python environment with installed dependencies (see Research Reagent Solutions).

Methodology:

  • Data Preprocessing: Prepare an AnnData object containing the raw or normalized gene expression matrix, with cell type annotations stored in adata.obs.
  • Embedding Generation: Pass the preprocessed data through the target scFM in evaluation mode to extract cell embeddings.
    • For Geneformer: Use the geneformer.get_embeddings() method with emb_mode="cell" (v1) or emb_mode="cls" (v2) [61].
    • For scGPT: Use the model's forward pass to generate the cell embeddings as described in the official documentation [62].
  • Dimensionality Reduction: Apply Principal Component Analysis (PCA) to the embeddings, retaining the top 50 principal components.
  • Clustering and Evaluation: Perform k-nearest neighbors (K-NN) clustering on the PCA-reduced embeddings. Calculate the Average BIO score and Average Silhouette Width (ASW) to quantify cluster purity and separation [4].

Analysis:

  • Compare the scores against established baselines (HVG, scVI, Harmony) run on the same dataset.
  • A higher BIO score (closer to 1) indicates better alignment between clusters and ground-truth cell types.

D Start Start Annotated Dataset Preprocess Data Preprocessing Start->Preprocess Embed Generate Cell Embeddings (Zero-shot) Preprocess->Embed Reduce Dimensionality Reduction (PCA) Embed->Reduce Cluster K-NN Clustering Reduce->Cluster Evaluate Calculate Metrics (BIO, ASW) Cluster->Evaluate Compare Compare vs. Baselines Evaluate->Compare

Figure 1: Workflow for zero-shot clustering evaluation.

Protocol 2: Benchmarking Gene Expression Prediction

Objective: To evaluate a model's core understanding of gene-gene relationships by testing its ability to predict the expression of held-out genes.

Materials:

  • As in Protocol 1.

Methodology:

  • Data Splitting: For a given cell, mask the expression values of a random 10% of its genes.
  • Model Inference: Input the cell with masked genes into the model and collect its predictions for the masked values.
  • Performance Calculation: For each masked gene, calculate the Mean Absolute Error (MAE) or Mean Squared Error (MSE) between the predicted and actual expression values. Aggregate these scores across all cells and masked genes in the test set [40] [3].

Analysis:

  • A perfect predictor would have an MAE/MSE of zero. Compare the model's error against a simple baseline, such as predicting the median expression value of each gene across the training set.
  • As noted in benchmarking, some models may struggle, performing only slightly better than this median baseline, particularly for low-to-medium expressed genes [3].

D Cell Input Cell Expression Vector Mask Mask 10% of Genes Cell->Mask Predict Model Predicts Masked Values Mask->Predict ComputeError Compute Error (MAE, MSE) Predict->ComputeError CompareBaseline Compare vs. Median Baseline ComputeError->CompareBaseline

Figure 2: Workflow for expression prediction benchmarking.

The Scientist's Toolkit: Research Reagent Solutions

The following table details key software and data resources required for working with scFMs.

Table 5: Essential Research Reagents for scFM Evaluation

Reagent / Resource Type Function / Application Source / Reference
CellxGene Database Data Primary source of large-scale, publicly available single-cell data for pre-training and benchmarking. https://cellxgene.cziscience.com/ [60]
scGPT Repository Software Provides code for loading pre-trained weights, generating embeddings, and fine-tuning. https://github.com/bowang-lab/scGPT [62]
Geneformer Repository Software Official implementation of Geneformer. Requires Git LFS to download model weights. https://github.com/bschnorrerlab/Geneformer [62]
Zero-shot Evaluation Code Software Benchmarking code from Microsoft Research for reproducing cell clustering and batch integration tests. https://github.com/microsoft/zero-shot-scfoundation [62]
Helical Package Software A unified package facilitating easy access and evaluation of various bio-foundation models, including Geneformer. https://github.com/helical-ai [61]
OBO Foundry / Wikipedia Data Sources of structured and free-text biological knowledge for multimodal pre-training (e.g., cell type descriptions). https://obofoundry.org/ [60]

Based on the consolidated findings from recent benchmarks, the following recommendations are provided for researchers and drug development professionals:

  • Temper Expectations for Zero-Shot Discovery: Practitioners should be cautious about using current scFMs for pure zero-shot discovery on critical tasks. Simpler methods like HVG selection, scVI, or Harmony may provide more reliable and interpretable results for tasks like initial clustering and batch correction [4] [3].
  • Validate with Baselines: Always include established baselines in any evaluation pipeline. The superior performance of simple methods in recent benchmarks underscores that model complexity and scale do not automatically translate to better zero-shot performance [63] [4].
  • Consider Fine-Tuning for Specific Tasks: If labeled data is available, fine-tuning can significantly improve performance, as evidenced by the gains seen in Geneformer v2 [61]. scFMs should currently be viewed as powerful base models for transfer learning rather than out-of-the-box discovery engines.
  • Evaluate on Multiple Metrics and Datasets: Model performance is highly dataset-dependent. A comprehensive evaluation should use multiple metrics (e.g., BIO, ASW, batch mixing scores) across diverse biological contexts to build confidence in a model's utility [40] [4].
  • Monitor Multimodal Advances: Emerging models like LangCell and scMMGPT, which integrate textual knowledge, represent a promising direction. They have shown improved performance in cell annotation and better generalization, suggesting that multimodal learning may be key to unlocking more robust zero-shot capabilities [60] [64].

In conclusion, while single-cell foundation models are a rapidly evolving and powerful new class of tools, their application in zero-shot settings requires careful validation. By adhering to standardized benchmarking protocols and maintaining a critical perspective relative to simpler methods, the research community can best leverage these models to drive meaningful biological discovery.

In the rapidly evolving field of single-cell genomics, foundation models (scFMs) pretrained on millions of cells have emerged as powerful tools for extracting biological insights from complex data. These models, including scGPT, Geneformer, and scBERT, leverage transformer architectures to learn universal representations of cellular states [1]. However, their practical application, particularly in zero-shot learning settings where models are applied without task-specific fine-tuning, requires careful consideration of the inherent trade-offs between performance, interpretability, and computational cost. This framework is essential for researchers and drug development professionals who must select appropriate models for discovery-driven research where predefined labels are often unavailable [4].

The evaluation of these trade-offs is critical because, as recent studies indicate, scFMs do not consistently outperform simpler baseline methods in zero-shot settings. In some cases, selecting highly variable genes (HVG) can surpass foundation models in tasks like cell type clustering and batch integration [4]. This application note provides a structured approach to interpreting evaluation results, enabling informed decision-making for biological discovery and therapeutic development.

Quantitative Performance Benchmarking in Zero-Shot Settings

Rigorous evaluation of scFMs against established baselines is crucial for assessing their practical utility. Performance benchmarks should encompass multiple biological and technical contexts to reveal model strengths and limitations. The following metrics and comparisons provide a standardized framework for model assessment.

Performance Metrics and Evaluation Criteria

The table below outlines key metrics for evaluating scFMs across common single-cell analysis tasks:

Task Category Specific Task Key Evaluation Metrics Interpretation Guide
Cell-level Tasks Cell Type Clustering Average BIO (AvgBIO) score, Average Silhouette Width (ASW) Higher scores indicate better separation of known cell types [4].
Batch Integration Principal Component Regression (PCR) score, Batch mixing scores Lower PCR indicates better batch effect removal; higher batch mixing scores indicate better integration [4].
Gene-level Tasks Gene Function Prediction Gene ontology enrichment, Prior knowledge alignment Measures biological relevance of gene embeddings [16].
Clinical Applications Drug Sensitivity Prediction Accuracy, AUC-ROC Model performance in predicting therapeutic responses [16].
Cancer Cell Identification F1 score, Precision-Recall Accuracy in distinguishing malignant from benign cells [16].

Comparative Performance of scFMs and Baselines

Recent benchmarking studies reveal that no single scFM consistently outperforms all others across diverse tasks. The following table summarizes the zero-shot performance of leading scFMs compared to established baseline methods:

Model / Method Cell Type Clustering Batch Integration Biological Relevance Key Strengths and Limitations
scGPT Variable performance; outperforms baselines on some datasets (e.g., PBMC 12k) but underperforms on others [4] Robust on complex datasets with biological batch effects; outperforms Harmony and scVI on Immune and Tabula Sapiens datasets [4] Captures meaningful biological insights into relational structure of genes and cells [16] Strength: Strong across diverse tasks; Limitation: Inconsistent zero-shot clustering performance [17]
Geneformer Underperforms HVG, scVI, and Harmony across most datasets and metrics [4] Consistently ranks last across batch integration metrics; embeddings often retain batch effects [4] Benefits from effective pretraining strategies for gene-level tasks [17] Strength: Effective pretraining for gene-level tasks; Limitation: Poor batch integration and cell type clustering zero-shot [4]
scFoundation Not specifically evaluated in clustering Not specifically evaluated in batch integration Demonstrates strong capabilities in gene-level tasks [17] Strength: Gene-level task performance; Limitation: Limited evaluation on cell-level tasks
scBERT Limited zero-shot evaluation available Limited zero-shot evaluation available Lags behind larger models likely due to smaller size and limited training data [17] Strength: Architecture design; Limitation: Model scale constraints
HVG (Baseline) Outperforms Geneformer and scGPT across all metrics [4] Achieves best batch integration scores across all datasets [4] Provides fundamental biological signal Strength: Simple, effective, computationally efficient; Limitation: Limited capacity for complex pattern recognition
scVI (Baseline) Outperforms proposed foundation models in cell type clustering [4] Excellent technical batch effect correction; challenged by biological variation in Immune datasets [4] Captures biologically meaningful variation Strength: Robust probabilistic modeling; Limitation: May overcorrect biological variation
Harmony (Baseline) Competitive performance with scFMs [4] Effective technical integration; challenged by Tabula Sapiens complexity [4] Preserves biological structure while removing technical artifacts Strength: Fast, efficient integration; Limitation: Struggles with highly diverse datasets

Interpretability Frameworks and Methods

Model interpretability is essential for debugging, establishing trust, and deriving biological insights from scFMs. Various techniques can be applied to understand model decisions and the biological relevance of learned representations.

Interpretability Techniques for Foundation Models

The following table outlines key interpretability methods applicable to scFMs:

Interpretability Technique Mechanism Applicable Tasks Biological Insights Generated
SHAP (SHapley Additive exPlanations) Computes feature importance by measuring marginal contribution across feature combinations [65] Cell type prediction, Gene expression prediction, Drug response Identifies genes most influential to specific model predictions [65]
Attention Mechanism Analysis Analynes patterns in transformer self-attention weights to identify gene-gene relationships [1] Gene regulatory network inference, Cell state transitions Reveals potential regulatory relationships and coordinated gene expression patterns [1]
Embedding Dimensionality Reduction Projects high-dimensional cell embeddings to 2D/3D space using UMAP or t-SNE [4] Cell type clustering, Batch integration assessment Visualizes cellular heterogeneity and model representation quality [4]
Global Surrogate Models Trains interpretable models to approximate complex foundation model predictions [65] Model debugging, Feature importance analysis Provides simplified, interpretable approximations of complex model behavior [65]
scGraph-OntoRWR (Novel Metric) Measures consistency between cell type relationships in embeddings and prior biological knowledge [16] Evaluation of biological relevance in embeddings Quantifies how well model captures established biological hierarchies [16]

Interpreting Model Limitations

Interpretability analyses reveal why scFMs may underperform in zero-shot settings. For example, analysis of Geneformer's embeddings shows they often fail to retain sufficient cell type information, with clustering primarily driven by batch effects rather than biological signals [4]. Similarly, investigating attention patterns can reveal whether models focus on biologically plausible gene relationships or spurious technical correlations.

The Lowest Common Ancestor Distance (LCAD) metric provides a biologically-grounded approach to evaluating cell type annotation errors by measuring the ontological proximity between misclassified cell types, with smaller distances indicating more biologically reasonable errors [16].

Computational Resource Requirements

The scale of scFMs creates significant computational demands throughout the model lifecycle, from pretraining to deployment. Understanding these requirements is essential for practical implementation.

Computational Cost Analysis

Model Parameter Count Pretraining Dataset Size Inference Memory Requirements Fine-tuning Efficiency
scGPT ~50 million [16] 33 million non-cancerous human cells [16] High (512-dimensional embeddings) [16] Parameter-efficient methods available (adapters, prefix tuning) [31]
Geneformer ~40 million [16] 30 million cells [16] Moderate (256-512-dimensional embeddings) [16] Requires full fine-tuning in standard approach
UCE ~650 million [16] 36 million cells [16] Very high (1280-dimensional embeddings) [16] Limited information on efficient fine-tuning
scFoundation ~100 million [16] 50 million cells [16] High (3072-dimensional embeddings) [16] Architecture supports various fine-tuning approaches
scBERT ~6 million [1] 1.12 million human cells [31] Lower than larger models Less computationally intensive fine-tuning

Efficient Fine-Tuning Strategies

Recent advances in parameter-efficient fine-tuning enable adaptation of scFMs with minimal computational overhead:

  • Adapter-based Approaches: Insert small trainable layers within transformer blocks, training less than 1% of original parameters while maintaining performance [31]
  • Prefix Tuning: Prepends trainable tensors to each transformer block, achieving comparable results to full fine-tuning with 0.1% of parameters [31]
  • Drug-Conditional Adapters: Enable conditioning on unseen modalities (e.g., molecular structures) while preserving biological knowledge from pretraining [31]

Experimental Protocols for Trade-off Evaluation

Standardized protocols enable consistent evaluation of the trade-offs between performance, interpretability, and computational cost.

Protocol 1: Zero-Shot Cell Type Clustering Assessment

Purpose: Evaluate model performance in discriminating cell types without task-specific training.

Materials:

  • Pretrained model (scGPT, Geneformer, or alternatives)
  • Benchmark dataset with ground truth cell labels (e.g., Tabula Sapiens, Pancreas)
  • Baseline methods (HVG, scVI, Harmony)
  • Computing environment with adequate GPU memory

Procedure:

  • Data Preprocessing:
    • Load target dataset and apply standard normalization
    • Generate cell embeddings using foundation model's zero-shot capability
    • Apply same preprocessing for baseline methods
  • Embedding Generation:

    • For transformer models: forward pass through pretrained network without fine-tuning
    • Extract cell embeddings from model's representation layer
    • Reduce dimensionality using PCA (50 components) for baseline comparisons
  • Clustering and Evaluation:

    • Apply Leiden clustering to embeddings across all methods
    • Calculate AvgBIO and ASW scores against ground truth labels
    • Compare results across methods and datasets
  • Interpretability Analysis:

    • Apply UMAP visualization to embeddings from each method
    • Use SHAP analysis to identify genes driving cluster formation
    • Calculate scGraph-OntoRWR score to assess biological consistency

Interpretation: Models with higher AvgBIO/ASW scores and scGraph-OntoRWR values provide better separation of biologically meaningful cell types. Superior performance of simple baselines may indicate limitations in foundation model pretraining.

Protocol 2: Batch Integration Capability Assessment

Purpose: Evaluate model ability to remove technical artifacts while preserving biological variation.

Materials:

  • Dataset with known batch effects (e.g., Pancreas dataset with 5 sources)
  • Pretrained foundation models and baseline methods
  • Evaluation metrics (PCR, batch mixing scores)

Procedure:

  • Embedding Generation:
    • Generate cell embeddings using foundation models and baseline methods
    • Ensure consistent gene space alignment across datasets
  • Quantitative Evaluation:

    • Calculate PCR score measuring proportion of variance explained by batch
    • Compute batch mixing scores assessing neighborhood purity
    • Compare metrics across methods
  • Biological Preservation Assessment:

    • Assess whether integrated embeddings maintain separation of known biological groups
    • Compare cell type clustering performance before and after integration

Interpretation: Effective batch correction shows low PCR scores (effective batch removal) while maintaining biological structure. Models that over-correct by removing biological variation should be identified and potentially avoided.

Protocol 3: Computational Efficiency Benchmarking

Purpose: Quantify computational resources required for training and inference.

Materials:

  • Standard benchmarking environment (specified GPU, CPU, memory)
  • Model implementations with consistent deep learning framework
  • Timing and memory profiling tools

Procedure:

  • Inference Speed Assessment:
    • Measure time to generate embeddings for standardized dataset (e.g., 10,000 cells)
    • Profile GPU memory usage during inference
    • Compare across model architectures
  • Fine-tuning Efficiency:

    • Implement parameter-efficient fine-tuning methods (adapters, prefix tuning)
    • Measure training time and memory requirements compared to full fine-tuning
    • Assess performance retention with reduced parameter updates
  • Scaling Analysis:

    • Evaluate how inference time scales with increasing dataset size
    • Measure memory requirements for different batch sizes

Interpretation: Models with favorable performance-compute trade-offs enable broader application, particularly in resource-constrained environments. Performance gains of large models must be justified by their computational costs.

Integrated Decision Framework

Selecting the appropriate scFM requires balancing multiple factors based on specific research goals and constraints. The following workflow visualizes the decision process:

scFM_decision_framework Start Start: Choose Single-Cell Foundation Model TaskType What is the primary task? Start->TaskType CellLevel Cell-level Task (Clustering, Annotation) TaskType->CellLevel GeneLevel Gene-level Task (Networks, Function) TaskType->GeneLevel Perturbation Perturbation Prediction TaskType->Perturbation ZeroShot Zero-shot required? (No labels available) CellLevel->ZeroShot GeneLevel->ZeroShot Rec2 Recommendation: Geneformer (Strong gene-level task performance) GeneLevel->Rec2 Perturbation->ZeroShot YesZeroShot Yes ZeroShot->YesZeroShot NoZeroShot No ZeroShot->NoZeroShot DataSize Dataset Size and Diversity YesZeroShot->DataSize Compute Computational Resources NoZeroShot->Compute LargeData Large & Diverse DataSize->LargeData SmallData Small & Specific DataSize->SmallData Rec1 Recommendation: scGPT (Strong zero-shot performance across diverse tasks) LargeData->Rec1 Rec4 Recommendation: Baseline Methods (HVG, scVI, Harmony) with simpler ML models SmallData->Rec4 HighCompute High Compute->HighCompute LowCompute Limited Compute->LowCompute Rec3 Recommendation: scGPT with parameter-efficient fine-tuning HighCompute->Rec3 LowCompute->Rec4

Essential Research Reagents and Computational Tools

Successful implementation of scFMs requires both computational tools and biological resources. The following table details key components of the research toolkit:

Tool/Resource Type Function Application Context
BioLLM Framework Software Tool Unified interface for integrating and evaluating diverse scFMs [17] Standardized benchmarking and model switching across different architectures
CELLxGENE Dataset Data Resource Curated single-cell datasets with standardized annotations [1] Pretraining and evaluation of foundation models
SHAP (SHapley Additive exPlanations) Interpretability Library Explains model predictions by quantifying feature importance [65] Identifying genes driving model decisions and detecting potential biases
Parameter-efficient Fine-tuning Methods Algorithmic Approach Adapters, prefix tuning for model adaptation with minimal parameters [31] Adapting foundation models to new tasks with limited data and compute
scGraph-OntoRWR Evaluation Metric Quantifies consistency between embedding relationships and biological knowledge [16] Assessing biological relevance of learned representations
WebAIM Contrast Checker Accessibility Tool Verifies color contrast ratios for data visualizations [66] Creating accessible figures that meet WCAG guidelines

Interpreting the trade-offs between performance, interpretability, and computational cost in single-cell foundation models requires a multifaceted approach. Current evidence suggests that while scFMs show promise in capturing complex biological relationships, their zero-shot performance does not consistently surpass simpler methods across all tasks. Researchers should select models based on specific task requirements, dataset characteristics, and computational constraints, using the structured evaluation framework presented here. As the field evolves, continued benchmarking and development of interpretability methods will be essential for realizing the full potential of foundation models in biological discovery and therapeutic development.

Conclusion

The current generation of single-cell foundation models represents a promising yet maturing technology. While they offer the potential for versatile, generalizable biological insights and have demonstrated success in specific applications like efficient fine-tuning for drug response prediction, rigorous zero-shot evaluations reveal they do not consistently outperform established, simpler methods on core tasks like cell type clustering and batch integration. Their true value appears to be task-dependent, excelling where their learned representations of biological relationships can be leveraged. Future progress hinges on developing more biologically meaningful pretraining objectives, creating standardized and rigorous evaluation frameworks that prioritize zero-shot capability, and improving model interpretability. For researchers and clinicians, this means a pragmatic approach is essential: scFMs are powerful new tools for the arsenal, but their application should be guided by specific task requirements and validated against traditional baselines. Their continued evolution holds the key to unlocking deeper insights into cellular function, disease mechanisms, and accelerating personalized therapeutic discovery.

References