scGPT vs Geneformer: A Critical Benchmarking Review for Biomedical Researchers

Nora Murphy Nov 27, 2025 152

This article provides a comprehensive, evidence-based comparison of two prominent single-cell foundation models, scGPT and Geneformer, tailored for researchers and drug development professionals.

scGPT vs Geneformer: A Critical Benchmarking Review for Biomedical Researchers

Abstract

This article provides a comprehensive, evidence-based comparison of two prominent single-cell foundation models, scGPT and Geneformer, tailored for researchers and drug development professionals. We synthesize recent benchmarking studies to evaluate their performance across key tasks like cell type annotation, batch integration, and perturbation prediction. The analysis covers foundational principles, practical applications, optimization strategies, and rigorous validation, revealing that while both models show promise, their zero-shot performance often lags behind simpler methods. This review offers actionable insights for model selection and discusses the future trajectory of foundation models in clinical and biomedical research.

Understanding scGPT and Geneformer: Architectures and Pretraining Paradigms

The advent of single-cell RNA sequencing (scRNA-seq) has revolutionized biological research by enabling the investigation of cellular heterogeneity, developmental trajectories, and disease mechanisms at unprecedented resolution. However, the high dimensionality, sparsity, and technical noise inherent to scRNA-seq data present significant analytical challenges. Inspired by the remarkable success of transformer architectures in natural language processing, computational biologists have developed specialized foundation models to harness the vast amounts of emerging single-cell data. These models, pretrained on millions of cells, promise to learn universal biological representations that can be adapted to diverse downstream tasks with minimal fine-tuning. Among the most prominent architectures in this rapidly evolving field are scGPT and Geneformer, which embody contrasting philosophical approaches to modeling transcriptomic data. This article provides a comprehensive comparison of these two pioneering models, examining their core architectures, pretraining strategies, and performance across key biological tasks to guide researchers in selecting appropriate tools for their specific analytical needs.

Architectural Philosophies: A Technical Comparison

scGPT: Generative Pretrained Transformer for Single-Cell Biology

scGPT adopts a decoder-only transformer architecture similar to the GPT series, treating single-cell transcriptomes as sequences of gene-expression pairs. The model processes input data by creating three distinct embeddings for each gene: a gene identity embedding, an expression value embedding (often using value binning), and no positional embedding, reflecting the assumption that gene interactions are non-sequential and permutation-invariant. scGPT employs a masked language modeling pretraining objective where randomly selected genes are masked, and the model learns to reconstruct their expression values based on the context provided by other genes. This approach allows scGPT to learn the complex, context-dependent relationships between genes across diverse cell types and tissues. With 50 million parameters pretrained on approximately 33 million human cells, scGPT aims to build a comprehensive foundation model capable of generalizing across multiple omics modalities, including scRNA-seq, scATAC-seq, and spatial transcriptomics [1] [2].

Geneformer: Context-Aware Encoder Architecture

In contrast, Geneformer utilizes a transformer encoder architecture similar to BERT, with a distinctive rank-based input representation. Rather than using raw expression values, Geneformer employs "rank value encoding," where genes are sorted by expression level to create a cell-specific "sentence" of genes. This approach prioritizes the relative importance of genes within each cell while reducing technical variability. Geneformer incorporates both gene identity embeddings and positional embeddings, with the latter reflecting the ranked order of genes. Its pretraining utilizes a masked language modeling objective with a key distinction: instead of predicting continuous expression values, it predicts the identities of masked genes based on their context. With 40 million parameters pretrained on 30 million human cells, Geneformer is designed to capture gene-gene relationships and hierarchical regulatory networks, with a particular emphasis on context-aware representations that can illuminate biological mechanisms [1] [3].

Table 1: Core Architectural Comparison of scGPT and Geneformer

Architectural Feature	scGPT	Geneformer
Transformer Type	Decoder-only	Encoder-only
Primary Input Representation	Gene + value embeddings	Rank-based gene ordering
Value Embedding	Value binning	Ordering (implicit)
Positional Embedding	×	✓
Pretraining Dataset Size	~33 million cells	~30 million cells
Model Parameters	50 million	40 million
Pretraining Objective	Masked gene modeling with MSE loss	Masked gene modeling with CE loss
Gene Tokenization	1200 HVGs	2048 ranked genes

Performance Benchmarking: Quantitative Comparisons

Zero-Shot Capability Assessment

Zero-shot performance, where models are applied without task-specific fine-tuning, is crucial for exploratory biological research where labeled data may be unavailable. Recent evaluations reveal significant limitations in both models' zero-shot capabilities. In cell type clustering tasks measured by Average BIO (AvgBIO) score, both scGPT and Geneformer underperformed compared to simpler methods like highly variable genes (HVG) selection and established algorithms such as Harmony and scVI. Geneformer demonstrated particularly high variance across different datasets, while scGPT showed more consistent but still suboptimal performance. In batch integration tasks, which aim to remove technical artifacts while preserving biological signals, both models struggled to correct for batch effects, with Geneformer consistently ranking last across most evaluation metrics. Surprisingly, selecting HVGs alone often outperformed both transformer-based approaches in batch integration scores calculated in full dimensions [4] [5].

Task-Specific Performance Variations

Comprehensive benchmarking across diverse biological applications reveals a complex performance landscape where neither model consistently outperforms the other. Instead, each demonstrates strengths in specific domains. scGPT generally excels in perturbation prediction and multi-omic integration, leveraging its generative architecture to model cellular responses to genetic and chemical perturbations. Geneformer typically shows advantages in cell type annotation and in silico perturbation experiments, where its rank-based input representation appears to capture biologically meaningful hierarchies. However, benchmarking studies consistently note that performance is highly dependent on dataset characteristics and task requirements, with neither model establishing clear overall superiority [1] [6].

Table 2: Performance Comparison Across Key Biological Tasks

Task Category	Superior Model	Key Performance Notes	Primary Metric
Zero-shot Cell Type Clustering	HVG (baseline)	Both models underperformed vs. simpler methods	AvgBIO Score
Batch Integration	scVI/Harmony (baseline)	Geneformer consistently ranked last	iLISI, PCR
Cell Type Annotation	Geneformer	Better captures cell-type hierarchies	Accuracy
Perturbation Prediction	scGPT	Superior response modeling	MSE
Cross-Species Generalization	Geneformer	Mouse-Geneformer validated cross-species	Accuracy
Multi-omic Integration	scGPT	Handles diverse modalities	Integration Score

Experimental Protocols and Evaluation Methodologies

Standardized Evaluation Frameworks

Rigorous benchmarking of single-cell foundation models requires standardized evaluation protocols across diverse datasets and tasks. The most comprehensive evaluations employ multiple datasets representing different tissues, technologies, and biological conditions. For cell type clustering assessment, models generate cell embeddings which are evaluated using metrics like Average Silhouette Width (ASW) and Average BIO (AvgBIO) score, which measure the separation and purity of known cell types in the latent space. Batch integration performance is quantified using metrics such as Integration Local Inverse Simpson's Index (iLISI) for batch mixing and principal component regression (PCR) score for biological conservation. For perturbation tasks, models are evaluated on their ability to predict expression changes after genetic or chemical perturbations, typically measured using mean squared error (MSE) or correlation coefficients between predicted and observed expression changes [4] [1] [6].

Zero-Shot Evaluation Protocol

The zero-shot evaluation protocol is particularly important for assessing the fundamental biological knowledge captured during pretraining. In this setting, models generate embeddings without any task-specific fine-tuning, and these embeddings are directly used for downstream analyses. This approach tests the model's ability to extract biologically meaningful representations without additional training, which is especially valuable for exploratory research where labeled data may be unavailable or incomplete. Studies implementing this protocol have revealed significant limitations in current foundation models, demonstrating that their pretraining objectives do not necessarily translate to high-quality representations for all downstream tasks [4].

Technical Implementation: Research Reagent Solutions

Table 3: Essential Research Reagents for Single-Cell Foundation Model Experiments

Research Reagent	Function/Purpose	Implementation Example
CELLxGENE Datasets	Curated single-cell data for pretraining and benchmarking	33M human cells for scGPT pretraining
Highly Variable Genes (HVG)	Feature selection to reduce dimensionality	1200 HVGs for scGPT input
Rank Value Encoding	Input representation method for Geneformer	2048 ranked genes per cell
Masked Language Modeling	Self-supervised pretraining objective	Randomly mask 15% of genes
Harmony	Batch integration benchmark algorithm	Compare against foundation models
scVI	Variational autoencoder benchmark	Baseline for clustering and integration
Perturb-Seq Data	Genetic perturbation datasets for evaluation	Evaluate perturbation prediction accuracy

Architectural Visualizations

Model Architecture Comparison

Evaluation Workflow Diagram

The comparative analysis of scGPT and Geneformer reveals that neither model establishes universal superiority across all tasks and applications. Instead, each exhibits distinct strengths aligned with their architectural philosophies. scGPT's generative, value-based approach demonstrates advantages in perturbation modeling and multi-omic integration, while Geneformer's rank-based, context-aware encoder architecture shows stronger performance in cell type annotation and hierarchical biological reasoning. Critically, both models face reliability challenges in zero-shot settings, where simpler methods like HVG selection or traditional algorithms like scVI and Harmony can sometimes outperform these complex foundation models. This suggests that while the transformer architecture provides substantial modeling power, the pretraining objectives and strategies for single-cell data require further refinement to consistently extract biologically meaningful representations.

For researchers and drug development professionals, model selection should be guided by specific analytical needs rather than presumed general capability. scGPT may be preferable for studies focusing on cellular responses to perturbations or integrating multimodal data, while Geneformer might better serve projects requiring fine-grained cell type discrimination or exploration of gene regulatory hierarchies. Future developments in this rapidly evolving field will likely address current limitations through improved pretraining strategies, more biologically informed architectures, and enhanced evaluation frameworks that better capture performance in real-world research scenarios. As both approaches continue to mature, they hold tremendous promise for advancing our understanding of cellular biology and accelerating therapeutic discovery.

In the analysis of single-cell RNA sequencing (scRNA-seq) data, foundation models like scGPT and Geneformer have emerged as powerful tools for decoding cellular heterogeneity. These models employ a critical preprocessing step called tokenization, which transforms raw gene expression data into a structured format that deep learning models can process. The choice of tokenization strategy fundamentally shapes how a model perceives and interprets biological information, influencing its performance across diverse tasks such as cell type annotation, batch integration, and perturbation prediction. scGPT utilizes a value binning approach, converting continuous expression values into discrete categories, whereas Geneformer adopts a gene ranking method, representing each cell by the relative ordering of gene expression levels. This guide provides a detailed, evidence-based comparison of these two strategies, examining their technical implementations, performance characteristics, and suitability for different research applications within the life sciences.

Technical Foundations of Tokenization Strategies

Value Binning in scGPT

The value binning strategy employed by scGPT is designed to convert continuous, high-dimensional gene expression data into a discrete, sequence-like format compatible with transformer architectures.

Process Overview: scGPT's tokenization begins by treating each gene as a distinct token, assigned a unique identifier. The raw count data from the cell-by-gene matrix undergoes normalization before the continuous expression values are discretized into a fixed number of bins [7]. This binning process transforms the inherently continuous measurement of gene expression into categorical values, effectively creating a vocabulary of expression levels.
Technical Implementation: The model uses an embedding size of 512 and processes data through 12 transformer blocks with 8 attention heads each [7]. A key technical aspect is its use of value binning to convert all expression counts into relative values, facilitating the model's ability to learn from the discretized expression spectrum [7]. During pretraining, scGPT employs an iterative masked gene modeling objective with mean squared error (MSE) loss, where certain genes are masked and the model must reconstruct their binned expression values [1].
Architectural Considerations: Unlike natural language processing where word order provides critical information, gene sequences lack inherent ordering. scGPT addresses this by omitting positional embeddings, relying instead on the attention mechanism to learn gene-gene relationships without presuming sequential dependencies [1].

Gene Ranking in Geneformer

Geneformer implements a rank-based tokenization strategy that emphasizes relative gene expression patterns over absolute values, focusing on the most biologically informative genes for distinguishing cell states.

Process Overview: Geneformer represents each cell's transcriptome as a rank value encoding where genes are sorted by their expression level in that specific cell, normalized by their median expression across the entire pretraining corpus [8]. This approach creates a nonparametric representation that prioritizes genes that best distinguish cell states, effectively deprioritizing ubiquitously highly-expressed housekeeping genes while promoting transcription factors and other regulatory elements that may be lowly expressed but highly informative [8].
Technical Implementation: The tokenization process requires raw counts scRNA-seq data with Ensembl IDs for genes and total read counts (n_counts) for cells [9] [10]. For the V2 model series, the input size is 4096 genes, with special tokens (CLS and EOS) added to the rank value encoding [10]. The model is pretrained using a masked learning objective where 15% of genes in each transcriptome are masked, and the model predicts which gene belongs in each masked position based on the contextual information from the remaining unmasked genes [8].
Biological Rationale: The ranking approach leverages the massive scale of the pretraining corpus (approximately 30 million cells for V1, 104 million for V2) to normalize gene expression across diverse cellular contexts [8]. This strategy is theoretically more robust to technical artifacts that systematically bias absolute transcript counts while preserving the relative ranking of genes within each cell [8].

Table 1: Technical Specifications of scGPT and Geneformer Tokenization Approaches

Feature	scGPT (Value Binning)	Geneformer (Gene Ranking)
Input Data	Raw count matrix [7]	Raw counts without feature selection [9]
Gene Identification	Gene tokens with unique identifiers [7]	Ensembl IDs [9]
Value Processing	Binning into discrete categories [7]	Ranking by expression level [8]
Normalization	Custom binning technique [7]	Median expression across pretraining corpus [8]
Model Input Size	1,200 highly variable genes [1]	4,096 genes (V2 series) [10]
Positional Encoding	Not used [1]	Used in encoder [1]
Pretraining Objective	Masked gene modeling with MSE loss [1]	Masked gene prediction with cross-entropy loss [8]

Performance Comparison in Downstream Tasks

Zero-Shot Evaluation Evidence

Recent rigorous evaluations of foundation models in zero-shot settings—where models are applied without task-specific fine-tuning—reveal critical insights into the real-world performance of these tokenization strategies.

Cell Type Clustering Performance: In comprehensive benchmarking, both scGPT and Geneformer demonstrated limitations in zero-shot cell type separation compared to established methods. When evaluated across multiple datasets, both models performed worse than selecting highly variable genes (HVG) and more established methods like Harmony and scVI in cell type clustering, as measured by average BIO (AvgBio) score [4] [11]. Notably, the simple approach of selecting HVGs outperformed both Geneformer and scGPT across all metrics [4] [11].
Batch Integration Capabilities: Batch integration—correcting for technical variations across datasets while preserving biological signals—poses significant challenges for both tokenization approaches. Evaluation of the Pancreas benchmark dataset revealed that while Geneformer and scGPT can integrate experiments using the same technique, they generally fail to correct for batch effects between different techniques [4] [11]. Geneformer's embeddings particularly struggled, with clustering primarily driven by batch effects rather than biological information [4] [11].
Contextual Performance Variability: Performance varies significantly based on dataset characteristics and match with pretraining data. scGPT showed better performance on the PBMC (12k) dataset compared to scVI, Harmony, and HVG, but underperformed on other datasets [4] [11]. Surprisingly, models did not consistently outperform baselines even on datasets that were included in their pretraining corpus, indicating an unclear relationship between pretraining objectives and downstream task performance [4] [11].

Table 2: Zero-Shot Performance Metrics Across Evaluation Studies

Task & Metric	scGPT	Geneformer	HVG Baseline	scVI Baseline
Cell Type Clustering (AvgBIO)	Variable (Best: PBMC)	Underperforms baselines	Outperforms both models	Outperforms both models
Batch Integration (Pancreas)	Partial success	Primarily batch-driven	Effective integration	Effective integration
PCR Score	Moderate	Consistently ranks last	Varies by dataset	Second best overall
ASW Metric	Comparable to scVI on some datasets	Underperforms baselines	Strong performance	Strong performance

Biological Insight Capture

Beyond technical metrics, the ability of tokenization strategies to capture meaningful biological relationships represents a crucial dimension for evaluation.

Gene Network Inference: Geneformer's ranking approach demonstrates particular strength in capturing gene-gene relationships and network hierarchy. During pretraining, Geneformer gains a fundamental understanding of network dynamics, encoding network hierarchy in the model's attention weights in a completely self-supervised manner [8]. This capability enabled the identification of a novel transcription factor in cardiomyocytes that was experimentally validated as critical to contractile force generation [8].
Perturbation Prediction: In predicting cellular responses to genetic and chemical perturbations, both tokenization approaches face challenges. In benchmarking against large perturbation models (LPM), both Geneformer and scGPT were outperformed across multiple experimental settings [12]. When used for perturbation prediction, both models were consistently and significantly outperformed by the specialized LPM approach, regardless of preprocessing methodology [12].
Knowledge Representation: Alternative approaches like GenePT suggest that combining textual gene information with expression data may enhance biological insight capture. GenePT utilizes ChatGPT embeddings of gene summaries from NCBI, achieving comparable or better performance than Geneformer and scGPT on many downstream tasks despite requiring no single-cell data curation or pretraining [13]. This indicates that textual gene representations effectively capture biological relationships relevant to single-cell analysis.

Experimental Protocols for Performance Evaluation

Standardized Benchmarking Methodology

To ensure fair comparison between tokenization strategies, researchers have developed standardized evaluation protocols that assess model performance across multiple biological tasks.

Dataset Selection and Preparation: Benchmarking studies employ diverse datasets representing different tissues, technologies, and biological conditions. Key datasets include Tabula Sapiens, Pancreas datasets with five different sources, PBMC (12k), and Immune datasets [4] [11]. These datasets are selected to represent both technical variation (different experimental protocols) and biological variation (different cell types, tissues, and donors). Standard preprocessing includes quality control, normalization, and filtering using tools like Scanpy [13].
Evaluation Metrics: Multiple complementary metrics provide a comprehensive performance assessment:
- Cell Type Separation: Average BIO (AvgBio) score and average silhouette width (ASW) evaluate how well embeddings separate known cell types [4] [11].
- Batch Integration: Batch integration scores and principal component regression (PCR) quantify the removal of technical artifacts while preserving biological variation [4] [11].
- Biological Consistency: Novel metrics like scGraph-OntoRWR measure consistency of captured cell type relationships with prior biological knowledge from cell ontologies [1].
Experimental Controls: Studies include multiple baselines for comparison, including simple methods (highly variable genes), established algorithms (Harmony, scVI), and ablations of the foundation models themselves. For scGPT, variants include randomly initialized models and models pretrained on different tissue-specific subsets to disentangle the effects of pretraining data size versus composition [4] [11].

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Research Tools for Tokenization Strategy Evaluation

Tool/Resource	Function	Relevance to Tokenization Comparison
CELLxGENE Census	Large-scale single-cell data repository	Provides standardized pretraining data for scGPT [7]
Genecorpus-30M/104M	Curated single-cell transcriptome collection	Pretraining corpus for Geneformer [8]
Scanpy	Single-cell analysis in Python	Standardized data preprocessing pipeline [13]
Harmony	Batch effect correction algorithm	Performance baseline for integration tasks [4] [11]
scVI	Probabilistic modeling of scRNA-seq	Generative model baseline for comparison [4] [11]
HVG Selection	Feature selection method	Simple baseline for cell type separation [4] [11]
NCBI Gene Database	Gene summary information	Source for text-based embeddings in GenePT [13]

The comparative analysis of value binning (scGPT) and gene ranking (Geneformer) tokenization strategies reveals a complex performance landscape where neither approach consistently dominates across all tasks and contexts. The gene ranking method employed by Geneformer demonstrates particular strength in capturing gene network hierarchies and biological relationships, making it well-suited for discovery tasks focused on understanding regulatory mechanisms and identifying key drivers of cell state changes. Conversely, the value binning approach of scGPT offers advantages in certain integration tasks and provides a more direct representation of expression levels that may benefit quantitative prediction tasks.

Current evidence suggests that foundation models with both tokenization strategies underperform simpler methods in zero-shot settings for basic tasks like cell type clustering and batch integration [4] [11]. This indicates that the biological understanding captured during pretraining does not necessarily translate to robust out-of-the-box performance for standard analytical tasks. However, both models show value in more specialized applications, particularly when fine-tuned with task-specific data.

For researchers selecting between these approaches, considerations should include:

Task Objectives: Gene ranking may better support network biology and mechanistic insights, while value binning may suit quantitative prediction tasks.
Data Characteristics: Gene ranking's robustness to technical artifacts benefits heterogeneous data integration, while value binning preserves more quantitative information.
Computational Resources: scGPT's focused vocabulary (1,200 HVGs) reduces computational requirements compared to Geneformer's broader gene representation.
Biological Interpretability: Geneformer's attention weights directly encode network hierarchy, potentially offering more transparent biological insights.

Future development will likely benefit from hybrid approaches that combine the strengths of both tokenization strategies, potentially incorporating external biological knowledge from textual sources to enhance model performance and biological relevance.

The emergence of single-cell RNA sequencing (scRNA-seq) has generated vast amounts of transcriptomic data, creating an unprecedented opportunity for applying deep learning models to decipher cellular language. Inspired by breakthroughs in natural language processing (NLP), researchers have developed foundation models pretrained on millions of single-cell transcriptomes using masked language modeling (MLM) objectives. Among these, scGPT and Geneformer represent two prominent architectures with distinct approaches to tokenization, model structure, and pretraining strategies. This guide provides an objective comparison of their performance across key biological tasks, supported by experimental data and standardized evaluation protocols, to inform researchers and drug development professionals selecting appropriate models for their specific applications.

Core Architectural Frameworks and Pretraining Approaches

Fundamental Model Architectures

scGPT and Geneformer both utilize transformer architectures but differ significantly in their implementation details and pretraining methodologies. The table below summarizes their core architectural characteristics:

Table 1: Architectural Comparison of scGPT and Geneformer

Feature	scGPT	Geneformer
Model Type	Decoder-style transformer	Encoder-style transformer
Parameters	~50 million	~40 million (6-layer)
Pretraining Data	33 million human cells [4] [14]	30 million human cells [4] [14]
Tokenization	Value binning of 1200 highly variable genes [1]	Ranking of 2048 genes by expression [1]
Value Representation	Discrete expression bins [14]	Relative gene ranking [14]
Positional Embedding	Not used [1]	Used [1]
Pretraining Task	Iterative MLM with MSE loss [1]	MLM with gene ID prediction [1]

Pretraining Workflows Visualized

The following diagram illustrates the core pretraining workflows for both models, highlighting their methodological differences:

Performance Comparison Across Key Biological Tasks

Zero-Shot Cell Type Clustering and Annotation

Zero-shot performance is critical for biological discovery where labeled data is scarce. Recent evaluations reveal significant limitations in both models when used without fine-tuning:

Table 2: Zero-Shot Cell Type Clustering Performance (AvgBIO Score)

Dataset	scGPT	Geneformer	HVG Baseline	scVI Baseline	Harmony Baseline
PBMC (12k)	0.62	0.45	0.58	0.59	0.55
Tabula Sapiens	0.51	0.38	0.56	0.54	0.49
Pancreas	0.48	0.41	0.55	0.53	0.52
Immune	0.53	0.43	0.57	0.56	0.54

Data source: Genome Biology evaluation [4] - Higher scores indicate better performance.

Both models underperform compared to simpler methods like Highly Variable Genes (HVG) selection and established algorithms like scVI and Harmony across most datasets [4] [15]. scGPT shows relatively better performance on the PBMC dataset, while Geneformer consistently ranks lowest across evaluation metrics.

Batch Integration Capabilities

Batch effect correction is essential for integrating datasets from different sources. The performance varies significantly between technical and biological batch effects:

Table 3: Batch Integration Performance (Batch Mixing Score)

Dataset	Batch Type	scGPT	Geneformer	HVG Baseline	scVI Baseline
Pancreas	Technical	0.52	0.38	0.61	0.59
PBMC	Technical	0.55	0.41	0.63	0.61
Tabula Sapiens	Biological	0.58	0.45	0.59	0.55
Immune	Biological	0.57	0.43	0.60	0.53

Data source: Genome Biology evaluation [4] - Higher scores indicate better batch mixing.

Qualitative assessment reveals that while Geneformer's embeddings primarily separate by batch effects with minimal cell type information, scGPT provides some cell type separation but still exhibits batch-driven clustering [4]. Both models struggle with technical batch effects between different experimental techniques.

Performance on Specialized Tasks

Beyond standard evaluations, both models show distinct strengths in specialized applications:

Table 4: Performance on Specialized Biological Tasks

Task	scGPT	Geneformer	Evaluation Context
Gene Network Inference	Moderate	Moderate	scPRINT outperforms both [16]
Drug Response Prediction	Strong	Moderate	Comprehensive benchmark [1]
Cell Type Annotation (Fine-tuned)	Strong	Strong	BioLLM framework evaluation [17]
Perturbation Prediction	Strong	Moderate	Multi-task benchmark [1]

Notably, a comprehensive benchmark evaluating six foundation models against established baselines found that no single scFM consistently outperforms others across all tasks, emphasizing the need for task-specific model selection [1].

Experimental Protocols and Evaluation Methodologies

Standardized Evaluation Workflow

To ensure fair comparison, researchers have developed standardized evaluation protocols:

Key Evaluation Metrics Explained

AvgBIO Score: Measures balanced integration of batch effects with biological conservation, higher values indicate better performance [4]
ASW (Average Silhouette Width): Evaluates clustering quality, values range from -1 (poor) to 1 (strong) [4]
Batch Mixing Score: Quantifies how well batches are integrated, higher values indicate better mixing [4]
PCR (Principal Component Regression): Measures proportion of variance explained by batch effects, lower values indicate better batch correction [4]
scGraph-OntoRWR: Novel metric measuring consistency of cell type relationships with prior biological knowledge [1]
LCAD (Lowest Common Ancestor Distance): Measures ontological proximity between misclassified cell types, assessing severity of annotation errors [1]

Essential Research Reagent Solutions

The following table details key computational tools and resources essential for reproducing foundation model comparisons:

Table 5: Essential Research Reagents for scFM Evaluation

Resource	Type	Function	Availability
CELLxGENE Census	Data Resource	Standardized single-cell datasets for training and evaluation [18]	Public
BioLLM Framework	Software Tool	Unified interface for diverse single-cell foundation models [17]	Open Source
scGraph-OntoRWR	Evaluation Metric	Novel ontology-informed metric for biological relevance [1]	Custom Implementation
Harmony	Baseline Algorithm	Batch integration baseline for performance comparison [4]	Open Source
scVI	Baseline Algorithm	Probabilistic modeling baseline for performance comparison [4]	Open Source
BenGRN Benchmark	Evaluation Suite	Specialized benchmark for gene network inference [16]	Open Source

The comparative analysis reveals that neither scGPT nor Geneformer consistently outperforms simpler baseline methods in zero-shot settings, challenging the assumption that larger pretrained models automatically provide superior biological insights [4] [15]. However, both models show value in specific applications: scGPT demonstrates robust performance across multiple tasks including drug response prediction, while Geneformer's rank-based approach provides distinctive embeddings for certain gene-level tasks [1] [17].

For researchers and drug development professionals, selection should be guided by specific use cases: scGPT may be preferable for multi-task applications requiring flexible fine-tuning, while established baselines like HVG selection or scVI remain competitive for standard clustering and batch correction tasks. Future development should focus on improving zero-shot capabilities through better pretraining objectives and incorporating biological prior knowledge to move beyond pattern recognition toward genuine biological understanding.

Defining 'Zero-Shot' Performance and Its Critical Role in Biological Discovery

In the evolving field of single-cell biology, foundation models like scGPT and Geneformer represent a transformative approach to analyzing cellular data. These models are pretrained on massive datasets comprising millions of single-cell gene expression profiles, with the goal of learning universal biological patterns that can generalize across diverse applications. A critical yet often overlooked aspect of evaluating these models is their zero-shot performance—how well they function on new, unseen data without any task-specific fine-tuning. Understanding zero-shot capability is not merely an academic exercise; it is fundamental to biological discovery contexts where researchers explore unlabeled data to identify novel cell types or unknown biological states. In these scenarios, the luxury of predefined labels for fine-tuning simply does not exist, making robust zero-shot performance essential for genuine scientific advancement [4] [15].

Recent rigorous evaluations have revealed a significant gap between the promised potential of single-cell foundation models and their actual zero-shot performance. Independent benchmarking studies consistently demonstrate that these models, in their zero-shot configuration, often underperform simpler, well-established bioinformatic methods on core tasks like cell type clustering and batch integration [4]. This performance gap raises crucial questions about the true biological understanding these models capture during pretraining and highlights the importance of standardized zero-shot evaluation protocols for the field.

The Critical Importance of Zero-Shot Evaluation

Why Zero-Shot Capability Matters for Discovery

Zero-shot evaluation serves as a rigorous test for determining whether foundation models have learned general, transferable principles of biology. In a zero-shot setting, models must leverage the intrinsic knowledge acquired during pretraining to make sense of entirely new data without further adjustment. This capability is paramount for exploratory biological research, where the objective is often to discover previously unknown patterns—such as novel cell states or disease-specific pathways—without the guidance of pre-existing labels. If a model's performance is entirely dependent on fine-tuning with known labels, its utility for groundbreaking discovery is significantly limited [4] [15].

Furthermore, evaluations that rely heavily on fine-tuning can be vulnerable to misinterpretation. Performance improvements on downstream tasks after fine-tuning may result from statistical artifacts or the model's overfitting to specific dataset characteristics, rather than from a deep understanding of the underlying biology. Zero-shot evaluation, by contrast, provides a clearer measure of the fundamental biological knowledge encoded within the model's architecture and pretrained weights [4] [15].

Key Zero-Shot Tasks in Single-Cell Analysis

Cell Type Clustering: The model must generate embeddings (numerical representations) for individual cells that group together cells of the same type, even when those types were not explicitly labeled during training. This is crucial for annotating new datasets or discovering rare cell populations [4].
Batch Integration: The model must correct for technical variations (e.g., differences between labs, sequencing technologies, or donors) while preserving true biological differences. Effective zero-shot batch integration allows for the meaningful combination and comparison of datasets from diverse sources [4].
Biological State Representation: More advanced models aim to disentangle and represent a cell's biological state (e.g., disease, perturbation) from its core cell type identity, enabling the study of processes like disease progression or drug response across different datasets [19].

Experimental Protocols for Evaluating Zero-Shot Performance

Standardized Benchmarking Workflows

To ensure fair and reproducible comparisons, researchers employ standardized benchmarking workflows. A typical zero-shot evaluation protocol involves the following steps:

Model Loading: A pretrained foundation model (e.g., scGPT or Geneformer) is loaded with its publicly available weights. No further training on the target benchmark dataset is performed.
Embedding Generation: The model processes the gene expression matrix of the benchmark dataset to generate a lower-dimensional embedding vector for each cell.
Downstream Task Application: These embeddings are directly used for specific analytical tasks:
- For clustering, algorithms like K-means or Leiden are applied to the embeddings, and the resulting clusters are compared to known cell type labels.
- For batch integration, visualization tools like UMAP are used to inspect whether cells from different batches mix well within the same cell type.
Quantitative Scoring: The results are evaluated against ground truth labels using established metrics that quantitatively measure success [4] [1].

This workflow emphasizes that the model is used as a fixed feature extractor, mirroring how a researcher would apply it to a truly novel dataset in a discovery setting.

Key Evaluation Metrics

The following metrics are central to quantifying zero-shot performance in the tasks described above:

Average BIO (AvgBIO) Score: A comprehensive metric for evaluating cell type clustering. It assesses both the purity of clusters (how well cells of one type are grouped together) and their separation from other cell types. A higher score indicates better clustering [4].
Average Silhouette Width (ASW): Measures how similar a cell is to its own cluster compared to other clusters. It is used for both cell type separation (ASW({celltype})) and batch integration (ASW({batch})) [4] [1].
Principal Component Regression (PCR) Score: Quantifies the proportion of variance in the embeddings that can be explained by batch effects. A lower PCR score indicates better batch effect correction [4].

The diagram below illustrates the logical relationship between the core concepts of zero-shot evaluation, its importance, and the methods used to assess it.

Performance Comparison: scGPT vs. Geneformer

Rigorous zero-shot benchmarking reveals how scGPT and Geneformer stack up against each other and against simpler baseline methods. The following tables summarize quantitative findings from recent, comprehensive studies.

Table 1: Zero-shot performance in cell type clustering (AvgBIO Score). Higher scores are better. Data adapted from [4].

Model / Method	Pancreas Dataset	Immune Dataset	Tabula Sapiens	PBMC (12k)
HVG (Baseline)	0.771	0.732	0.681	0.639
Harmony	0.759	0.702	0.661	0.647
scVI	0.768	0.691	0.673	0.658
scGPT	0.692	0.599	0.619	0.652
Geneformer	0.542	0.501	0.523	0.521

Table 2: Zero-shot performance in batch integration (Batch ASW). Scores are scaled between 0 (poor) and 1 (good). Data adapted from [4] [1].

Model / Method	Technical Batch Effects	Biological Batch Effects	Overall Ranking
HVG (Baseline)	0.851	0.819	1
scVI	0.862	0.801	2
Harmony	0.841	0.787	3
scGPT	0.823	0.812	4
Geneformer	0.801	0.794	5

Analysis of Comparative Performance

The data leads to several critical conclusions:

Underperformance Against Simpler Methods: Both scGPT and Geneformer are consistently outperformed in cell type clustering by established methods like Harmony, scVI, and even the simple HVG selection method. In some cases, the foundation models perform worse than a randomly initialized model [4] [15].
scGPT vs. Geneformer: scGPT generally demonstrates stronger zero-shot capabilities than Geneformer across both clustering and integration tasks. However, its performance is inconsistent and remains below that of the top baselines [4] [17].
Task-Dependent Strengths: scGPT shows a relative strength in handling complex batch effects that include biological variation (e.g., differences between donors), sometimes matching or slightly exceeding simpler methods on these specific datasets. Geneformer, however, consistently ranks last in batch integration metrics [4].
Limitations in Gene Expression Modeling: A core hypothesis for the poor performance is that the masked language model pretraining objective may not be effectively teaching models the underlying gene-gene relationships. For instance, scGPT struggles to predict held-out gene expression, often defaulting to predicting median expression levels instead of context-specific values [15].

The Scientist's Toolkit: Key Research Reagents and Solutions

To conduct rigorous zero-shot evaluations, researchers rely on a suite of computational tools and benchmark resources. The following table details the essential components of this toolkit.

Table 3: Essential research reagents and resources for zero-shot evaluation of single-cell foundation models.

Tool / Resource	Type	Function in Evaluation	Key Features
scGPT [4] [1]	Foundation Model	The model under evaluation; generates cell and gene embeddings.	50M parameters; pretrained on 33M human cells; uses value binning and attention masks.
Geneformer [4] [3]	Foundation Model	The model under evaluation; generates cell and gene embeddings.	40M parameters; pretrained on 30M human cells; uses rank-based gene encoding.
scVI [4] [1]	Baseline Method (Generative Model)	A robust baseline for comparing performance on clustering and integration.	Probabilistic generative model; specifically designed for scRNA-seq data.
Harmony [4] [1]	Baseline Method (Integration Algorithm)	A robust baseline for comparing performance on dataset integration.	Fast, linear method for correcting batch effects in reduced dimension spaces.
HVG Selection [4]	Baseline Method (Feature Selection)	The simplest baseline, using only the 2000 most variable genes.	Provides a performance floor; computationally trivial.
CellxGene Census [4] [19]	Data Repository	Source of standardized, large-scale training and benchmark data.	Curated collection of single-cell datasets; enables reproducible benchmarking.
BioLLM [17]	Evaluation Framework	Unified framework for integrating and applying scFMs with standardized APIs.	Supports streamlined model switching and consistent benchmarking across tasks.

Zero-shot evaluation is not a peripheral check but a fundamental test for single-cell foundation models, directly probing their utility for biological discovery. Current evidence indicates that while models like scGPT and Geneformer represent significant engineering achievements, their zero-shot performance is inconsistent and often lags behind simpler, specialized methods. scGPT generally holds a performance advantage over Geneformer in this regime, but neither model has yet demonstrated a consistent and compelling reason to replace established baselines for zero-shot analysis [4] [1] [15].

The path forward requires a concerted effort from the community. Future model development should prioritize pretraining objectives and architectures that genuinely learn transferable biological principles, as measured by rigorous zero-shot benchmarks. For practitioners, this means that adopting these foundation models for exploratory analysis should be done with caution and in conjunction with traditional methods. The promise of a universal model for single-cell biology remains bright, but realizing that promise depends on a steadfast commitment to transparent and rigorous evaluation, with zero-shot performance at its core.

Practical Performance in Core Single-Cell Analysis Tasks

Single-cell RNA sequencing (scRNA-seq) has revolutionized biological research by enabling the transcriptomic profiling of individual cells, uncovering cellular heterogeneity with unprecedented precision. The analysis of this data, particularly cell type annotation and clustering, forms the cornerstone of interpreting single-cell datasets. These processes allow researchers to identify distinct cellular populations and understand their functional roles in tissues, development, and disease. Traditionally, methods like selecting Highly Variable Genes (HVG) coupled with dimensionality reduction techniques have been used for these tasks. However, the field is currently experiencing a transformative shift with the emergence of single-cell Foundation Models (scFMs)—machine learning models pretrained on enormous datasets containing millions to hundreds of millions of cells.

Models like scGPT and Geneformer represent this new paradigm. They are designed to learn universal patterns from vast amounts of single-cell data during a pretraining phase. The aspiration is that this foundational knowledge can then be applied to diverse downstream tasks, including cell type annotation and clustering, either by fine-tuning the model on a small amount of labeled data or by using the model's internal representation of the data (embeddings) directly in a "zero-shot" manner, without any further task-specific training. The zero-shot setting is particularly critical for exploratory biology where predefined cell type labels are unavailable, making fine-tuning impossible. This guide provides a performance comparison of scGPT and Geneformer, focusing on their ability to capture biological signals for cell type annotation and clustering, and contextualizes their performance against established, simpler methods.

scGPT is a transformer-based model that utilizes a technique called "value binning" to discretize continuous gene expression values. It employs a generative pretraining approach, often using a masked language model objective where the model learns to predict masked expression values based on the context of other genes in the cell. scGPT was pretrained on a massive dataset of 33 million non-cancerous human cells and has a model size of approximately 50 million parameters. Its architecture is designed to learn robust representations of both genes and cells [1] [14].

Geneformer, in contrast, uses a "rank value encoding" strategy. Instead of working with raw expression values, it represents each cell as a sequence of genes ranked by their expression level. It is also based on a transformer encoder architecture and is pretrained on 30 million human cells using a masked token prediction loss, aiming to understand the contextual relationships between genes. Geneformer has a smaller architecture, with 40 million parameters [4] [3].

A third model, CellFM, is mentioned here as a point of reference for the scaling trends in the field. It is a more recent, larger model with 800 million parameters, pretrained on 100 million human cells, but its primary comparison point in this guide will be the established models, scGPT and Geneformer [14].

Performance Comparison in Key Tasks

Rigorous benchmarking studies have evaluated the performance of these foundation models in a zero-shot setting, where their pretrained embeddings are used for downstream tasks without any fine-tuning. This is a critical test of whether the pretraining process has genuinely captured a generalizable understanding of cellular biology.

Cell Type Clustering

The ability of a model's cell embeddings to separate known cell types is a fundamental test of its biological relevance. Evaluations across multiple datasets reveal a nuanced picture.

Table 1: Zero-Shot Cell Type Clustering Performance (AvgBIO Score)

Dataset	scGPT	Geneformer	HVG Baseline	scVI Baseline	Harmony Baseline
PBMC (12k)	Outperforms Baselines	Underperforms	Strong Performance	Strong Performance	Strong Performance
Tabula Sapiens	Comparable to scVI	Underperforms	Outperforms	Comparable to scGPT	Outperformed by scGPT
Pancreas	Comparable to scVI	Underperforms	Outperforms	Comparable to scGPT	Underperforms
Immune	Underperforms	Underperforms	Outperforms	Outperforms scGPT	Outperformed by scVI

Data adapted from [4]. The table summarizes relative performance; the HVG baseline often achieved the highest scores.

Key findings from these evaluations include:

Inconsistent Performance: Both scGPT and Geneformer demonstrate variable performance across different datasets. In many cases, they are outperformed by the simple method of selecting Highly Variable Genes (HVG) and by established integration methods like scVI and Harmony [4].
scGPT's Situational Strength: scGPT shows more robust performance than Geneformer in clustering, sometimes matching or slightly exceeding the performance of scVI on specific datasets like Tabula Sapiens and PBMC. However, this is not consistent across all benchmarks [4].
Geneformer's Limitations: Geneformer consistently underperforms in zero-shot cell type clustering across the evaluated datasets and metrics when compared to both baselines and scGPT [4].

Batch Integration

Batch integration, which removes technical variations between datasets while preserving biological differences, is another critical task for single-cell analysis.

Table 2: Batch Integration Performance Summary

Model	Overall Performance	Strengths	Weaknesses
scGPT	Moderate	Effective on complex datasets with combined technical/biological batch effects (e.g., Immune, Tabula Sapiens) [4].	Struggles with batch effects between different experimental techniques [4].
Geneformer	Poor	Limited qualitative separation of techniques [4].	Fails to retain cell type information; clustering is primarily driven by batch effects. Consistently ranks last quantitatively [4].
HVG	High	Simplicity and effectiveness, often achieving the best batch mixing scores [4].	-
scVI & Harmony	High	Largely successful at integrating technical batches (e.g., Pancreas) [4].	Can struggle with specific complex datasets (e.g., Harmony on Tabula Sapiens) [4].

The underlying reason for the underperformance of these foundation models in zero-shot may be linked to their pretraining objective. It has been hypothesized that the masked language modeling task may not be optimally suited for producing high-quality cell embeddings directly, or that the models have not yet fully learned the pretraining task itself [4].

Experimental Protocols for Benchmarking

To ensure fair and reproducible comparisons, benchmarking studies follow rigorous experimental protocols. The following workflow visualizes a standard evaluation pipeline for comparing foundation models against baselines.

Data Preprocessing and Model Inputs

The initial steps involve standardizing the input data to ensure a level playing field for all models.

Quality Control: Cells and genes are filtered based on metrics like the number of detected genes per cell, total counts per cell, and the percentage of mitochondrial reads. This removes low-quality cells and technical outliers [14].
Normalization: The gene expression matrix is normalized, typically by the total count per cell (e.g., to 10,000 transcripts), followed by a log(1+x) transformation. This accounts for differences in sequencing depth between cells [13].
Model-Specific Tokenization:
- For scGPT, the top ~1200 Highly Variable Genes (HVGs) are often selected, and their expression values are binned into discrete categories [1].
- For Geneformer, all genes in its vocabulary are used, and each cell is represented as a sequence of genes ranked by expression level [4] [3].

Evaluation Metrics and Methodology

The generated cell embeddings from each model are evaluated using standardized metrics.

Cell Type Clustering:
- Average BIO (AvgBIO) Score: A comprehensive metric for evaluating clustering accuracy.
- Average Silhouette Width (ASW): Measures how similar a cell is to its own cluster compared to other clusters. Higher values indicate better-defined clusters [4].
Batch Integration:
- Batch Mixing Metrics: Assess how well cells from different batches are intermixed within the embedding space.
- Principal Component Regression (PCR) Score: Quantifies the proportion of variance in the embeddings that can be explained by batch effects. A lower score indicates better batch correction [4].
Protocol: Embeddings are generated in a zero-shot fashion. For clustering, a graph-based clustering algorithm (e.g., Leiden) is applied directly to the embeddings, and the results are compared to ground-truth cell type labels. For batch integration, visual inspection (UMAP plots) is combined with the quantitative metrics above [4] [1].

The Scientist's Toolkit: Essential Research Reagents

The following table details key computational tools and resources essential for conducting evaluations of single-cell foundation models.

Table 3: Key Research Reagents and Computational Tools

Item Name	Function / Application	Relevance in Evaluation
Benchmarking Datasets (e.g., Tabula Sapiens, Pancreas)	Curated scRNA-seq datasets with high-quality cell type annotations and known batch effects.	Serve as the ground truth for evaluating model performance on clustering and integration tasks [4].
Baseline Algorithms (e.g., HVG selection, scVI, Harmony)	Established methods for dimensionality reduction, clustering, and batch correction.	Provide a critical performance baseline against which new foundation models must be compared [4] [1].
Evaluation Metrics (e.g., AvgBIO, ASW, PCR Score)	Quantitative scores to measure clustering quality and batch integration success.	Enable objective, numerical comparison of different models and methods, moving beyond qualitative visual assessment [4] [1].
Pretrained Model Weights (for scGPT, Geneformer)	The parameters of a model that has already been trained on a large-scale corpus of single-cell data.	Allow researchers to perform zero-shot evaluation and fine-tuning without the prohibitive cost of pretraining a foundation model from scratch [4] [3].

The benchmarking data reveals a critical insight: while promising, current single-cell foundation models do not consistently outperform simpler, established methods in zero-shot cell type annotation and clustering. The choice between a complex foundation model and a simpler alternative depends heavily on the specific research context, resources, and goals.

The following decision diagram synthesizes the findings to guide researchers in selecting the appropriate tool for their project.

Summary of Recommendations:

For Zero-Shot Exploratory Analysis: If you are exploring a new dataset with unknown cell types, start with simple baselines like HVG selection or scVI. The consistent and high performance of these methods makes them reliable first choices. While scGPT can be tried, its performance is not guaranteed to be better and may be inferior.
When Fine-Tuning is Possible: If you have a small amount of labeled data for your specific cell types, fine-tuning a foundation model like scGPT or Geneformer can unlock their potential and may lead to superior performance, as they can adapt their general knowledge to the specific task.
For Batch Integration: Rely on established methods like Harmony and scVI for correcting technical batch effects. Geneformer should be avoided for this task in its zero-shot form, and scGPT's performance is highly dataset-dependent.
Looking Forward: The field is rapidly evolving. Newer and larger models like CellFM (800M parameters) are emerging, showing that scaling model and data size can improve performance across various tasks [14]. Furthermore, innovative approaches like scNET, which integrates protein-protein interaction networks, and GenePT, which uses textual gene descriptions from scientific literature, offer complementary strategies that may address some of the limitations of expression-only models [20] [13].

Batch integration is a fundamental task in single-cell RNA sequencing (scRNA-seq) analysis, aimed at eliminating non-biological technical variations (batch effects) arising from multiple data sources—such as different experiments, sequencing technologies, or donors—while preserving meaningful biological differences [4]. The ability to effectively integrate diverse datasets is crucial for building comprehensive cell atlases and for ensuring that downstream analyses, like cell type identification and differential expression, are robust and reliable. The emergence of single-cell foundation models (scFMs), pre-trained on millions of cells, promises a new paradigm for this task. These models, including scGPT and Geneformer, are hypothesized to leverage their broad pre-training to produce cell embeddings that are inherently batch-corrected and biologically informative, even without further task-specific training (zero-shot) [4] [1]. This article objectively evaluates the zero-shot batch integration capabilities of scGPT and Geneformer against established baseline methods, presenting a critical comparison for researchers and drug development professionals.

Performance Comparison: scGPT vs. Geneformer vs. Baselines

Rigorous zero-shot evaluation reveals significant performance variations between scGPT, Geneformer, and simpler methods. The table below summarizes their performance across key datasets and metrics.

Table 1: Zero-shot Batch Integration Performance Comparison

Model / Method	Pancreas Dataset (Technical Variation)	Immune & Tabula Sapiens Datasets (Technical + Biological Variation)	Key Characteristics
scGPT	Underperforms against scVI and Harmony [4].	Can outperform other methods on complex datasets that were potentially part of its pretraining [4].	Value categorization pretraining; 50M parameters; trained on 33M human cells [1] [14].
Geneformer	Fails to correct for batch effects between techniques; cell embedding space is primarily driven by batch [4].	Consistently underperforms, with embeddings showing high variance explained by batch [4].	Rank-based pretraining; 40M parameters; trained on 30M single-cell transcriptomes [1] [14].
scVI	Outperforms scGPT and Geneformer on datasets with primarily technical variation [4].	Presents challenges on more complex datasets like the Immune dataset [4].	Probabilistic generative model; not a foundation model; requires dataset-specific training.
Harmony	Successfully integrates datasets like Pancreas [4].	Faces significant challenges with datasets like Tabula Sapiens [4].	Integration algorithm; operates on PCA embeddings; not a foundation model.
Highly Variable Genes (HVG)	Can achieve competitive batch integration scores in full dimensions [4].	A simple, often robust baseline for batch integration [4].	Simple feature selection method (e.g., top 2,000 most variable genes).

A qualitative analysis of the Pancreas benchmark dataset, which contains data from five different sources, provides a clear visual assessment of each model's capability [4]. In this dataset:

Geneformer's cell embedding space fails to retain information about cell type, with any clustering being primarily driven by batch effects [4].
While scGPT's embedding space offers some separation between cell types, the primary structure remains strongly influenced by batch effects [4].
In contrast, Harmony and scVI largely succeed in integrating the Pancreas dataset, demonstrating more effective batch correction [4].

Experimental Protocols for Benchmarking

To ensure fair and reproducible comparisons, benchmarking studies follow rigorous protocols. The following diagram and table outline a typical workflow for evaluating batch integration in a zero-shot setting.

Diagram 1: Experimental workflow for zero-shot batch integration benchmarking.

Table 2: Key Research Reagents and Computational Tools

Item / Resource	Function in Evaluation	Examples / Notes
Benchmark Datasets	Provide standardized ground truth for evaluating batch correction and bio-conservation.	Pancreas dataset [4], Immune datasets [4], Tabula Sapiens [4] [1].
Pre-trained Models	Source of zero-shot cell embeddings.	scGPT (various checkpoints) [4], Geneformer (6L architecture) [4] [15].
Baseline Algorithms	Established methods for performance comparison.	scVI [4] [1], Harmony [4], Highly Variable Genes (HVG) selection [4].
Evaluation Metrics	Quantify the success of batch integration.	Batch mixing scores (e.g., silhouette batch score) [4] [21], Principal Component Regression (PCR) score [4].
Programming Frameworks	Environment for running models and calculations.	Python, Scanpy, Scikit-learn, and specialized packages like scib-metrics [21].

Detailed Methodology

Data Preparation: Publicly available scRNA-seq datasets with known batch and cell type labels are collected. The data undergoes standard preprocessing, including quality control, normalization, and log-transformation [21]. The datasets are selected to represent different challenges, such as pure technical variation (e.g., different experiments on the same tissue) and a mix of technical and biological variation (e.g., data from different donors or tissues) [4].
Embedding Generation: In a zero-shot setting, the pre-trained foundation models (scGPT and Geneformer) are applied to the preprocessed data without any further fine-tuning. The cell embeddings are extracted directly from the models' output layers. For baseline methods, scVI is trained on each target dataset, while Harmony is applied to the principal components of the input data [4].
Quantitative Evaluation: The resulting embeddings are evaluated using metrics that balance two competing goals:
- Batch Correction: The effectiveness in removing batch effects is measured by how well cells from different batches mix. This is quantified by metrics like the silhouette batch score (where a lower score indicates better mixing) and entropy of batch mixing [21].
- Bio-conservation: The success in preserving true biological variation is measured by how well-known cell types remain separable in the integrated embedding. This is assessed using metrics like the silhouette label score and clustering metrics such as Normalized Mutual Information (NMI) or Adjusted Rand Index (ARI) with respect to cell type labels [21]. The PCR score specifically quantifies the proportion of variance in the embedding explained by batch effects, where a lower score indicates superior batch correction [4].

Interpreting the Results and Performance Drivers

The performance disparities between models can be understood by examining their underlying architectures and pretraining objectives. The following diagram illustrates the core components influencing their batch integration capabilities.

Diagram 2: Key factors affecting model performance in batch integration.

The search results suggest two primary hypotheses for the observed limitations of scGPT and Geneformer in zero-shot batch integration [4] [15]:

Pretraining Objective Mismatch: Both models use a Masked Gene Modeling (MGM) pretraining objective, where the model learns to predict randomly masked genes based on the context of other genes [4]. While this task is designed to teach the model gene-gene relationships, it may not directly translate to learning batch-invariant representations. The model might become proficient at imputation without developing a high-level, batch-agnostic understanding of cell state [15].
Ineffective Learning: An alternative explanation is that the models simply have not adequately learned the intended pretraining task. For instance, evaluations of scGPT's gene expression prediction capability showed that, without conditioning on its cell embedding, it often predicted the median expression value for every gene, demonstrating limited contextual understanding [15].

Notably, pretraining does confer some benefit, as pretrained versions of scGPT show clearer improvement in cell-type clustering over randomly initialized models [4]. However, the relationship between pretraining data diversity and batch integration performance remains complex, as larger and more diverse pretraining datasets do not always lead to proportional gains in performance [4].

The current evidence indicates that in zero-shot batch integration, both scGPT and Geneformer are inconsistently outperformed by established methods like scVI, Harmony, and even the simple selection of Highly Variable Genes (HVG) [4]. Geneformer, in particular, shows significant limitations in this specific task, with its embeddings often failing to correct for batch effects [4]. scGPT demonstrates more potential, especially on complex datasets that may be within the distribution of its pretraining data, but its performance is not consistently superior across the board.

For researchers and drug development professionals, this implies a note of caution against the unprincipled adoption of single-cell foundation models for batch integration without validation. When integrating batches without the opportunity for fine-tuning, practitioners are advised to:

Rigorously benchmark scGPT and Geneformer embeddings against traditional baselines like scVI and Harmony on their specific data.
Not disregard simpler methods, as HVG selection can sometimes provide a strong, computationally efficient baseline.
Consider the nature of the batches in their data, as model performance varies significantly between datasets dominated by technical effects versus those with mixed technical and biological sources of variation.

The field continues to evolve rapidly with the introduction of new models like CellFM and GeneMamba [14] [22], and novel interpretability techniques are being developed to understand what these models learn [23]. Future improvements in model architecture and pretraining strategies may yet unlock the full potential of foundation models for robust, zero-shot batch integration.

Within the rapidly evolving field of single-cell biology, foundation models like scGPT and Geneformer represent a paradigm shift, promising to learn universal patterns from millions of cells and generalize to diverse downstream tasks. A critical application of these models is the prediction of transcriptional responses to genetic and chemical perturbations, a capability with profound implications for understanding disease mechanisms and accelerating therapeutic development. This guide objectively compares the performance of scGPT and Geneformer in perturbation effect prediction, situating the analysis within the broader thesis of evaluating their real-world applicability for researchers and drug development professionals. Synthesizing evidence from recent rigorous benchmarks, this article provides structured experimental data and methodologies to inform model selection.

Performance Comparison: scGPT vs. Geneformer vs. Baselines

Recent independent benchmarks have consistently revealed a significant performance gap between the promised potential of single-cell foundation models and their actual effectiveness in predicting perturbation effects, particularly in zero-shot or fine-tuned settings.

Performance on Double Genetic Perturbations

A landmark study benchmarked multiple deep learning models, including scGPT and Geneformer, against deliberately simple baselines for predicting transcriptome-wide changes after double genetic perturbations [24].

Table 1: Performance on Double Perturbation Prediction (Norman et al. data) [24]

Model	Prediction Error (L2 Distance)	Notes
Additive Baseline	Lowest	Sum of individual logarithmic fold changes; uses no double perturbation data [24]
No Change Baseline	Medium	Always predicts control condition expression [24]
GEARS	Higher than baseline	[24]
scGPT	Higher than baseline	[24]
Geneformer*	Higher than baseline	Repurposed with a linear decoder [24]
scBERT*	Higher than baseline	Repurposed with a linear decoder [24]
UCE*	Higher than baseline	Repurposed with a linear decoder [24]

Note: Models marked with an asterisk were not originally designed for perturbation prediction and were repurposed for the benchmark by combining them with a linear decoder [24].

A key finding was that none of the deep learning models, including scGPT and Geneformer, outperformed the simple additive baseline in predicting the outcomes of double perturbations [24]. Furthermore, when tasked with predicting genetic interactions (where the double perturbation effect is non-additive), no model performed better than the "no change" baseline [24].

Performance on Single-Gene Perturbations and Unseen Genes

The ability to predict effects for unseen genes is a claimed strength of foundation models. However, benchmarks on single-gene perturbation datasets (e.g., from Adamson et al. and Replogle et al.) tell a similar story.

Table 2: Performance on Single-Gene Perturbation Prediction [24] [25]

Model	Average Pearson Correlation (PCC)	Ability to Generalize to Unseen Genes
scLAMBDA (New Method)	0.786	Yes [25]
GenePert	0.775	Yes [25]
Linear Model with Pretrained Embeddings	Performance rivaling scGPT/GEARS	Yes [24]
GEARS	0.692	Limited [25]
scGPT	0.661	Limited [25]
Mean Prediction Baseline	Competitive with deep learning models	Not Applicable [24]

Notably, a simple linear model using pretrained gene embeddings from scGPT or scFoundation could match or exceed the performance of the full deep learning models from which the embeddings were extracted [24]. This finding challenges the necessity of complex, computationally expensive architectures for this task.

Underperformance in Zero-Shot Settings

The broader thesis of scGPT vs. Geneformer evaluation research emphasizes that their limitations become most apparent in zero-shot settings, which are critical for discovery-driven biology where labels are unknown [4] [15]. Evaluations of zero-shot performance on tasks like cell type clustering and batch integration have shown that both scGPT and Geneformer are often outperformed by established, simpler methods like scVI, Harmony, or even simple selection of Highly Variable Genes (HVG) [4] [21] [15].

Detailed Experimental Protocols

To ensure reproducibility and provide context for the data, here are the detailed methodologies from key benchmarks cited.

Data Source: CRISPR activation data from Norman et al., involving 100 single-gene and 124 double-gene perturbations in K562 cells.
Training/Test Split: Models were fine-tuned on all 100 single perturbations and 62 of the double perturbations. Performance was assessed on the remaining 62 held-out double perturbations.
Evaluation Metric: The primary metric was the L2 distance between predicted and observed expression values for the top 1,000 most highly expressed genes. Robustness was checked using five different random train-test splits.
Baselines: The "additive" model (sum of individual LFCs) and the "no change" model (predicts control expression) were used as simple, non-deep learning baselines.

Data Sources: CRISPRi Perturb-seq datasets from Adamson et al. (K562 cells) and Replogle et al. (K562 and RPE1 cells).
Task: Predict transcriptome-wide changes for single-gene perturbations, including for genes not seen during training.
Evaluation Metrics:
- Pearson Correlation Coefficient (PCC): Measures correlation between predicted and observed average gene expression changes [25].
- 2-Wasserstein Distance (W2): Quantifies the similarity between the predicted distribution of perturbed cells and the ground truth distribution, capturing single-cell heterogeneity [25].
Baselines: A deliberately simple linear model and a "mean prediction" baseline (predicts the average expression across the training set perturbations) were included.

Tasks: Cell type clustering and batch integration across multiple datasets (e.g., Tabula Sapiens, Immune, Pancreas).
Method: Pre-trained model embeddings were extracted without any further fine-tuning (zero-shot) and used for downstream tasks.
Metrics:
- Bio-conservation: Average BIO score (AvgBIO) and Average Silhouette Width (ASW) to measure how well embeddings separate cell types.
- Batch Correction: Metrics like iLISI to assess how well batch effects are removed while preserving biological variance.
Baselines: Performance was compared against scVI, Harmony, and Highly Variable Genes (HVG).

Signaling Pathways and Workflows

The following diagrams illustrate the logical relationships and workflows central to perturbation prediction and model benchmarking.

Perturbation Prediction Benchmarking Workflow

scGPT/Geneformer vs. Simple Baseline Performance

The Scientist's Toolkit: Research Reagent Solutions

This table details key computational tools and datasets essential for conducting rigorous perturbation prediction benchmarks.

Table 3: Essential Research Reagents for Perturbation Prediction Studies

Reagent / Resource	Type	Function in Evaluation	Example Source
Perturb-seq Datasets	Biological Data	Provides ground-truth gene expression measurements following genetic perturbations; essential for training and testing models.	Norman et al.; Adamson et al.; Replogle et al. [24] [25]
scGPT	Foundation Model	A transformer-based model pre-trained on single-cell data; evaluated for its ability to predict perturbation effects zero-shot or after fine-tuning.	Wang et al. [4] [24]
Geneformer	Foundation Model	A transformer-based model pre-trained on single-cell data; evaluated for its ability to predict perturbation effects zero-shot or after fine-tuning.	Theodoris et al. [4] [24]
GEARS	Deep Learning Model	A deep learning model specifically designed for perturbation prediction; often used as a state-of-the-art comparator.	Roohani et al. [24] [25]
scVI	Generative Model	A robust probabilistic model for single-cell data; frequently used as a high-performing baseline for tasks like integration and clustering.	Lopez et al. [4] [26]
Harmony	Integration Algorithm	A fast and effective method for data integration; used as a baseline for assessing batch correction and cell type separation.	Korsunsky et al. [4]
Linear Model / Additive Model	Mathematical Baseline	A deliberately simple model that serves as a critical sanity check; its strong performance highlights the challenges in this field.	N/A [24]
Benchmarking Frameworks (e.g., scib-metrics)	Software	Provides standardized metrics (e.g., ASW, iLISI, PCC) to ensure fair and consistent comparison across different models and studies.	Luecken et al. [21]

In the evolving field of single-cell RNA sequencing (scRNA-seq) analysis, foundation models like scGPT and Geneformer promise to learn universal biological patterns from massive datasets. A critical test of their utility, especially for exploratory research where predefined labels are unavailable, is their zero-shot performance—how well their pre-trained embeddings can be used for downstream tasks without any further model fine-tuning [4].

This guide objectively compares the zero-shot performance of scGPT and Geneformer against three established, and often simpler, baselines: the deep generative model scVI, the integration algorithm Harmony, and the straightforward approach of selecting Highly Variable Genes (HVG). The evaluation is based on recent, rigorous benchmarking studies that assessed these methods on common scRNA-seq analysis tasks, including cell type clustering and batch integration [4] [1].

Recent independent benchmarks consistently show that in a zero-shot setting, the proposed foundation models do not outperform the established baselines and can, in some cases, be significantly outperformed by them.

Quantitative Performance Data

The following tables summarize key quantitative results from benchmark studies, measuring performance in cell type clustering and batch integration.

Table 1: Cell Type Clustering Performance (AvgBIO Score) [4] This score measures the ability of a method to generate cell embeddings that separate known cell types. A higher score is better.

Method	Performance Summary
HVG Selection	Outperformed both Geneformer and scGPT across all metrics and datasets.
scVI	Generally outperformed Geneformer and scGPT on most datasets.
Harmony	Generally outperformed Geneformer and scGPT on most datasets.
scGPT	Underperformed relative to HVG, scVI, and Harmony on most datasets. Performance was inconsistent.
Geneformer	Underperformed relative to HVG, scVI, and Harmony across all metrics and datasets.

Table 2: Batch Integration Performance (Batch Mixing Score) [4] This evaluates the ability to remove technical batch effects while preserving biological variation.

Method	Performance Summary
HVG Selection	Achieved the best batch integration scores for all datasets in the benchmark.
scVI	Outperformed scGPT on datasets with purely technical variation (e.g., Pancreas, PBMC).
Harmony	Outperformed scGPT on datasets with purely technical variation (e.g., Pancreas, PBMC).
scGPT	Outperformed scVI and Harmony on more complex datasets combining technical and biological batch effects (e.g., Immune, Tabula Sapiens).
Geneformer	Consistently ranked last across all batch integration metrics, with embeddings often showing higher batch-related variance than the original data.

A separate large-scale benchmark in 2025 further confirmed that no single foundation model consistently outperforms others across all tasks, and simpler models can be more efficient for specific datasets, particularly under resource constraints [1].

Detailed Experimental Protocols

The conclusions drawn above are based on standardized evaluations designed to rigorously test model capabilities. Below is the workflow and detailed methodology for the key experiments cited.

Figure 1: Workflow for zero-shot evaluation of foundation models.

Benchmarking Workflow

The general workflow for the zero-shot evaluation of scGPT and Geneformer involved several key stages, as illustrated in Figure 1 [4].

Cell Type Clustering Protocol

Objective: To assess the model's ability to produce embeddings that separate known cell types without any task-specific training [4].

Input Data: Processed gene expression matrices from multiple public datasets (e.g., PBMC, Tabula Sapiens, Pancreas).
Embedding Generation:
- Foundation Models: Generate cell embeddings using the publicly available pre-trained weights of scGPT and Geneformer without any fine-tuning.
- Baselines: Apply the standard procedures for HVG selection, scVI, and Harmony to generate comparable embeddings or corrected spaces.
Clustering: Apply a standard clustering algorithm (e.g., Leiden clustering) to the embeddings from each method.
Evaluation Metrics:
- Average BIO (AvgBIO) Score: A composite metric evaluating cluster accuracy.
- Average Silhouette Width (ASW): Measures how well cells of the same type cluster together and separate from other types.

Batch Integration Protocol

Objective: To evaluate the model's capability to remove technical batch effects from datasets originating from different sources (e.g., labs, protocols) while preserving meaningful biological variation [4].

Input Data: Datasets with known, strong batch effects, such as the Pancreas benchmark which combines data from five different sources [4].
Integration: The pre-trained foundation model embeddings are taken as-is. Baselines like Harmony and scVI are run with their standard parameters to integrate the batches.
Visualization: The embeddings are visualized using UMAP to qualitatively assess batch mixing and cell type separation.
Quantitative Evaluation Metrics:
- Graph iLISI (Integration Local Inverse Simpson's Index): Measures the diversity of batches in the local neighborhood of each cell. A higher score indicates better batch mixing.
- PCR (Principal Component Regression) Score: Quantifies the proportion of variance in the embeddings that can be explained by batch. A lower score indicates more successful batch effect removal.

The Scientist's Toolkit

The following table details key computational tools and resources essential for replicating this type of benchmarking study or for applying these methods in research.

Table 3: Key Research Reagent Solutions

Tool / Resource Name	Function in Analysis	Relevance to Comparison
scGPT (Model Weights)	A transformer-based foundation model for single-cell data.	The primary model under evaluation in the zero-shot setting [4].
Geneformer (Model Weights)	A transformer-based foundation model trained on gene rank lists.	The primary model under evaluation in the zero-shot setting [4].
scvi-tools (Python Library)	Provides scalable, deep generative models for single-cell omics, including the scVI and scANVI methods.	Served as a strong established baseline for both batch integration and clustering [4] [27].
Harmony (R/Python Library)	An efficient algorithm for integrating datasets to remove batch effects.	Served as a strong established baseline for both batch integration and clustering [4] [28].
Scanpy (Python Library)	A scalable toolkit for single-cell gene expression data analysis.	Used for standard preprocessing, HVG selection, and downstream tasks like clustering and UMAP visualization [27].
Seurat (R Toolkit)	A comprehensive R package for single-cell genomics.	Used for data preprocessing, analysis, and provides an implementation of the Harmony integration method [27].
CELLxGENE Database	A curated corpus of single-cell datasets used for model pretraining and benchmarking.	Source of data for both pretraining foundation models and for independent evaluation datasets like AIDA v2 [4] [1].

The collective evidence from recent benchmarks indicates that while single-cell foundation models represent a significant theoretical advance, their practical utility in zero-shot applications is not yet superior to established, and often simpler, methods. For critical tasks like cell type clustering and batch integration, relying on the zero-shot embeddings of scGPT or Geneformer may lead to suboptimal results compared to using HVG selection, scVI, or Harmony.

The choice of method should therefore be task-dependent. For exploratory analysis where labels are unknown and fine-tuning is not feasible, simpler baselines currently offer more reliable performance. Foundation models may show their strength in scenarios where task-specific fine-tuning is possible, but their promise as robust, out-of-the-box tools for general single-cell analysis remains to be fully realized.

Overcoming Limitations: Fine-Tuning and Efficiency Strategies

In the rapidly evolving field of single-cell biology, foundation models like scGPT and Geneformer promise to revolutionize data analysis by learning universal patterns from millions of cells. However, rigorous evaluation reveals a surprising trend: their performance in zero-shot settings—where models are applied without any task-specific fine-tuning—often fails to surpass simpler, established methods. This guide objectively compares the zero-shot capabilities of scGPT and Geneformer against traditional baselines, providing researchers and drug development professionals with critical experimental data and insights to inform their analytical choices.

Performance Showdown: scGPT vs. Geneformer vs. Simple Baselines

Zero-shot evaluation is crucial for biological discovery tasks where predefined labels are unavailable, making fine-tuning impossible. When assessed under these conditions, both scGPT and Geneformer demonstrate significant limitations compared to simpler approaches across key tasks like cell type clustering and batch integration [4] [15].

The table below summarizes their performance against standard baselines:

Table 1: Zero-Shot Performance Comparison Across Key Tasks

Task	Evaluation Metric	scGPT	Geneformer	HVG (Baseline)	scVI (Baseline)	Harmony (Baseline)
Cell Type Clustering	Average BIO (AvgBio) Score	Underperforms baselines on most datasets [4]	Underperforms baselines across all datasets [4]	Outperforms both foundation models across all metrics [4]	Outperforms foundation models on most datasets [4]	Outperforms foundation models on most datasets [4]
Cell Type Clustering	Average Silhouette Width (ASW)	Comparable to scVI on some datasets [4]	Underperforms baselines [4]	Outperforms both foundation models [4]	Comparable to scGPT on some datasets [4]	Outperformed by scGPT on Tabula Sapiens [4]
Batch Integration	Batch Mixing Score (Pancreas Dataset)	Moderate (qualitative: some cell type separation, but batch-driven structure) [4]	Poor (qualitative: clustering primarily driven by batch effects) [4]	Best scores across all datasets (quantitative, full dimensions) [4]	Good (qualitative: largely succeeds in integration) [4]	Good (qualitative: largely succeeds in integration) [4]
Batch Integration	Principal Component Regression (PCR) Score	Varies by dataset [4]	Consistently high proportion of variance explained by batch [4]	Information not provided	Varies by dataset [4]	Varies by dataset; challenges with Tabula Sapiens [4]

Analysis of Key Performance Gaps

Cell Type Clustering: A core task where models must group cells by biological function. The highly variable genes (HVG) baseline, a simple feature selection strategy, consistently outperformed both scGPT and Geneformer. This indicates that the complex representations learned by foundation models during pre-training may not be superior to selecting the most variable genes for distinguishing cell types without fine-tuning [4] [15].
Batch Integration: This task involves removing technical variations between datasets without losing biological signal. Qualitatively, Geneformer's embeddings were often dominated by batch effects, while scGPT showed better but still imperfect integration. Quantitatively, the simple HVG baseline achieved the best scores, though established methods like scVI and Harmony also reliably outperformed the foundation models [4].

Diagram: Zero-Shot Evaluation Workflow for Single-Cell Foundation Models

Unpacking the Experimental Protocols

The comparative performance data comes from rigorous, standardized evaluations designed to test the models' generalizability in realistic discovery settings.

Core Evaluation Methodology

The zero-shot evaluation protocol follows these critical steps [4] [1]:

Model Acquisition: Pre-trained, publicly available versions of scGPT and Geneformer are used without any further training or fine-tuning.
Embedding Generation: Each model processes raw gene expression data from a held-out evaluation dataset to produce a lower-dimensional vector representation (embedding) for each cell.
Downstream Task Application: These embeddings are directly used to perform tasks like clustering for cell type identification or visualization for batch integration.
Benchmarking: The performance on these tasks is quantitatively compared against results obtained using established baseline methods (HVG, scVI, Harmony) on the same datasets.

Benchmarking Datasets

Evaluations use diverse, biologically-relevant datasets to ensure robustness. Key examples include [4]:

Pancreas Dataset: Combines data from five different sources, useful for testing batch integration.
PBMC (12k): A peripheral blood mononuclear cell dataset.
Tabula Sapiens: A multi-tissue atlas.
Immune Dataset: Contains various immune cell types.

Why Simple Methods Can Win

The underperformance of foundation models in zero-shot settings can be attributed to several fundamental factors related to their architecture and training.

Key Hypotheses for the Performance Gap

Ineffective Pre-training Task: Both scGPT and Geneformer use a masked language modeling objective, where they learn by predicting randomly masked gene expression values. Research suggests these models may not develop a deep understanding of this task. For instance, scGPT often predicts median expression values regardless of the true input, failing to capture deeper gene relationships [15].
Encoder Limitations: Encoder-based models like scGPT and Geneformer rely on extracting all contextual information from the input cell's gene expression profile. The characteristically high noise and sparsity of single-cell RNA-seq data can severely limit the reliability of the contextual information they extract [12].
Disconnect from Clustering Objectives: The pre-training task (gene expression prediction) may not be directly aligned with the objectives of downstream tasks like cell type clustering. Consequently, the embeddings produced may not optimally separate cell types compared to methods designed for that purpose [4].

Diagram: Why Simple Baselines Can Outperform Complex Models

The Scientist's Toolkit: Key Research Reagents

This table details the essential computational models and methods referenced in this field, which serve as the fundamental "reagents" for conducting comparative analyses.

Table 2: Key Models and Methods for Single-Cell Analysis

Name	Type	Primary Function/Description
scGPT	Single-Cell Foundation Model	A transformer-based model pre-trained on millions of cells. Generates cell and gene embeddings for downstream analysis tasks [1].
Geneformer	Single-Cell Foundation Model	A transformer-based encoder model pre-trained on 30 million single-cell transcriptomes. Uses a rank-based input representation [1].
Highly Variable Genes (HVG)	Statistical Baseline	A simple feature selection method that uses the top 2,000 most variable genes as input for analysis, serving as a strong baseline [4].
scVI	Generative Probabilistic Model	A deep generative model designed specifically for scRNA-seq data. Used for dimensionality reduction, batch correction, and clustering [4] [1].
Harmony	Integration Algorithm	A fast, precise integration algorithm for scRNA-seq data that corrects for batch effects by maximizing the diversity of cluster-specific datasets [4] [1].
Large Perturbation Model (LPM)	Alternative Architecture	A decoder-only model that integrates diverse perturbation experiments by disentangling Perturbation, Readout, and Context (PRC) dimensions [12].
CellFM	Large-Scale Foundation Model	A recently developed foundation model with 800 million parameters, pre-trained on ~100 million human cells, showcasing scaling potential [14].

Practical Recommendations for Researchers

Choosing the right tool requires a nuanced understanding of your specific task, data, and resources.

For Pure Discovery with No Labels: Rely on established baselines like Harmony or scVI for tasks such as initial clustering and batch correction. Their zero-shot performance is currently more reliable than that of foundation models [4] [1].
When Fine-Tuning is an Option: If you have labeled data for your specific task, foundation models like scGPT can be fine-tuned and may then show strong performance, leveraging the knowledge gained during pre-training [17].
Selecting a Foundation Model: Benchmarking indicates that no single foundation model consistently outperforms all others across every task [1]. scGPT has shown robust performance across multiple tasks, while Geneformer and scFoundation excel in gene-level tasks [17]. Use unified frameworks like BioLLM to standardize evaluation and model switching [17].
Look Beyond Immediate Performance: Consider the model's ability to integrate multi-omics data or provide biological interpretability, which can be a deciding factor for certain research questions [1].

The Road Ahead for Foundation Models

The current limitations in zero-shot performance do not negate the potential of the foundation model paradigm in single-cell biology. Rather, they highlight critical areas for improvement. Future success may depend on architectural innovations that move beyond masked language modeling, such as the Large Perturbation Model's (LPM) disentangled approach [12], or on scaling laws, as demonstrated by CellFM's training on 100 million cells [14]. For now, a cautious, evidence-based approach that leverages the strengths of both simple and complex models will drive the most robust biological discoveries.

Single-cell foundation models (scFMs), such as scGPT and Geneformer, are pretrained on millions of single-cell transcriptomes to learn universal patterns in gene expression data. However, their zero-shot performance—using pretrained embeddings without any further training—often reveals significant limitations. Evaluations demonstrate that in zero-shot settings for tasks like cell type clustering and batch integration, these models can be outperformed by simpler traditional methods like Highly Variable Genes (HVG) selection, scVI, or Harmony [4] [15]. This performance gap highlights the critical role of fine-tuning, the process of further training a pretrained model on a specific downstream task with a limited amount of task-labeled data. Fine-tuning adapts the general biological knowledge encoded during pretraining to specialized applications, enabling researchers to boost model accuracy for discovery-driven and clinical tasks such as cell type annotation, perturbation response prediction, and drug sensitivity analysis [1] [29].

Quantitative Performance Comparison of Fine-Tuned Models

The following tables summarize key experimental results from benchmark studies comparing fine-tuned scGPT and Geneformer across fundamental single-cell data analysis tasks.

Table 1: Performance Comparison on Cell-Level Tasks [1] [29]

Task	Model	Performance Metric	Key Finding
Cell Type Annotation	scGPT	High accuracy across diverse tissues	Demonstrates robust performance and versatility [1] [17]
	Geneformer	Enhanced accuracy after fine-tuning	Improved cell type classification after task-specific adaptation [3]
Batch Integration	scGPT	Effective on complex biological batch effects	Excels where batch effects include donor-to-donor biological variation [4]
	Geneformer	Struggles with technical batch effects	Embedding space often remains dominated by batch information [4]
Cancer Cell Identification	scGPT	Strong clinical task performance	Robust in identifying tumor microenvironment cells [1] [29]
	Geneformer	Effective for in silico perturbation	Identifies disease-causing genes validated by in vivo experiments [3]

Table 2: Performance Comparison on Gene-Level Tasks [1] [17]

Task	Model	Performance Metric	Key Finding
Gene Function Prediction	Geneformer	Strong performance	Benefits from effective pretraining strategy on gene relationships [17]
	scGPT	Good performance	Leverages large-scale pretraining for functional insights [14]
Perturbation Prediction	scGPT	Robust performance across tasks	Predicts cellular response to genetic or chemical perturbations [1]
	Geneformer	Context-aware predictions	Uses attention mechanism to model gene-gene relationships [3]

Experimental Protocols for Model Fine-Tuning

To ensure reproducible and effective fine-tuning, researchers should adhere to standardized methodologies. Below are detailed protocols for the key experiments cited in this guide.

Protocol for Cell Type Annotation

Objective: To adapt a pretrained scFM to accurately classify cell types in a new, labeled dataset.
Input Data: A labeled scRNA-seq dataset with known cell type annotations. The dataset should be split into training (e.g., 60-80%), validation (e.g., 10-20%), and held-out test sets (e.g., 10-20%) [21].
Fine-Tuning Procedure:
- Feature Extraction: Pass the input gene expression data through the pretrained model to obtain initial cell embeddings.
- Add Task-Specific Head: Append a fully connected classification layer on top of the model's embedding output. This layer's output dimension equals the number of unique cell types in the target dataset.
- Model Training: Train the entire model (or a subset of layers) end-to-end on the training split. A cross-entropy loss function is used to minimize the difference between predicted and true cell type labels.
- Hyperparameter Tuning: Optimize learning rate (typically very low, e.g., 1e-5 to 1e-4), batch size, and number of epochs using the validation set to prevent overfitting.
- Evaluation: Assess the final model on the held-out test set using metrics such as classification accuracy, F1-score, and the novel ontology-informed LCAD metric, which assesses the biological plausibility of misclassifications [1] [29].
Key Consideration: Techniques like Low-Rank Adaptation (LoRA) can be employed to dramatically reduce the number of trainable parameters, making fine-tuning more computationally efficient [14].

Protocol for In Silico Perturbation Prediction

Objective: To fine-tune a model to simulate the transcriptional consequences of knocking down or overexpressing a specific gene.
Input Data: scRNA-seq data from both control and perturbed conditions (e.g., from a CRISPR-based screen).
Fine-Tuning Procedure:
- Contextual Learning: The model is first pretrained on healthy cell atlases to learn the foundational "grammar" of gene networks [3].
- Task Specialization: The model is subsequently fine-tuned on data from perturbation experiments. This teaches the model to map the causal relationship between a specific genetic perturbation and the resulting changes in the transcriptional network.
- Simulation: For a novel cell, the model predicts the expression levels of all genes given a hypothetical perturbation (e.g., setting the expression of a target gene to zero). The difference between the predicted state and the original state reveals the downstream effects of the perturbation [3].
Validation: Predictions should be validated against held-out experimental data or known biological pathways.

Fine-Tuning Workflow for scFMs

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Computational Tools and Resources for scFM Fine-Tuning

Tool / Resource	Type	Primary Function in Fine-Tuning	Relevant Context
BioLLM Framework [17]	Software Framework	Unified API for integrating and applying diverse scFMs.	Standardizes fine-tuning and benchmarking across models like scGPT and Geneformer.
CELLxGENE Census [21]	Data Repository	Provides curated single-cell data and pretrained model embeddings.	Source of high-quality data for fine-tuning and evaluation.
Low-Rank Adaptation (LoRA) [14]	Optimization Technique	Reduces trainable parameters during fine-tuning.	Critical for efficient fine-tuning of large models like CellFM (800M parameters).
scGraph-OntoRWR [1] [29]	Evaluation Metric	Measures consistency of model-predicted cell relationships with known biology.	Provides biological interpretability beyond standard accuracy metrics.
HVG Selection [4]	Baseline Method	Simple feature selection using highly variable genes.	A strong baseline to benchmark fine-tuned scFM performance against.

Discussion and Best Practices for Model Selection

The empirical evidence leads to a central conclusion: no single foundation model consistently outperforms all others across every task [1] [29]. Therefore, the choice between scGPT and Geneformer for fine-tuning is task-dependent. scGPT has been noted for its robust and versatile performance across a wide range of tasks, including both cell-level and gene-level applications [17]. In contrast, Geneformer exhibits particular strength in gene-level tasks, such as predicting gene function and modeling genetic perturbations, benefiting from its context-aware, attention-based architecture [17] [3].

When deciding on a fine-tuning strategy, researchers should consider several factors:

Dataset Size: For smaller target datasets, extensive fine-tuning of a large model risks overfitting. In such cases, using simpler models or heavily leveraging parameter-efficient methods like LoRA may be more effective [1] [14].
Task Complexity: For novel tasks far from the pretraining objective (e.g., clinical outcome prediction), full end-to-end fine-tuning is often necessary to achieve peak performance.
Biological Interpretability: If understanding model decisions is a priority, tools like the scGraph-OntoRWR metric and attention mechanism analysis should be integrated into the evaluation pipeline [1] [29].

Ultimately, fine-tuning is not a one-size-fits-all process but a powerful, necessary step to bridge the gap between general-purpose pretraining and specialized, high-impact biological discovery.

Single-cell foundation models (scFMs) like scGPT and Geneformer represent a transformative advancement in computational biology, pretrained on millions of single-cell transcriptomes to learn fundamental biological principles [30]. These models leverage transformer architectures originally developed for natural language processing, treating cells as "sentences" and genes as "words" to capture complex gene-gene interactions and cellular states [31] [30]. However, adapting these massive models to specific downstream tasks presents significant computational challenges, including the risk of catastrophic forgetting and prohibitive resource requirements when using conventional full fine-tuning approaches [31].

Parameter-Efficient Fine-Tuning (PEFT) has emerged as a crucial methodology to address these limitations by preserving original model parameters while selectively updating newly introduced tensors [31]. This approach maintains the valuable pretrained knowledge while enabling rapid adaptation to new tasks with dramatically reduced computational overhead. Research demonstrates that PEFT can achieve up to a 90% reduction in trainable parameters compared to conventional fine-tuning while maintaining competitive performance on critical tasks like cell type identification [31]. This efficiency makes PEFT particularly valuable for research settings with limited computational resources, enabling broader access to state-of-the-art single-cell analysis capabilities.

Experimental Comparison: PEFT Performance on scGPT vs Geneformer

Methodology for PEFT Evaluation

The evaluation of PEFT strategies for single-cell foundation models follows standardized experimental protocols to ensure comparable results across different architectures. For both scGPT and Geneformer, researchers typically implement LoRA (Low-Rank Adaptation) and prefix prompt tuning as the primary PEFT methods [31]. The standard workflow involves:

Model Initialization: Loading pretrained scGPT (33 million human cells) and Geneformer (30 million human cells) weights without modification
PEFT Module Integration: Introducing trainable low-rank matrices (LoRA) or prependable trainable tokens (prefix tuning) to the base model
Task-Specific Training: Fine-tuning only the PEFT parameters on target datasets (e.g., cell type annotation benchmarks) while freezing base model parameters
Performance Assessment: Comparing accuracy, computational cost, and training time against fully fine-tuned models and traditional methods

Experiments typically utilize diverse single-cell transcriptomics datasets representing various tissues and conditions, with standard metrics including clustering accuracy (AvgBIO, ASW) for cell type identification and parameter efficiency measured by percentage of trainable parameters [31] [4].

Quantitative Performance Comparison

Table 1: Performance Comparison of PEFT Methods on scGPT and Geneformer

Metric	scGPT (Full Fine-tuning)	scGPT (PEFT)	Geneformer (Full Fine-tuning)	Geneformer (PEFT)
Trainable Parameters	100% (∼100M)	~10% (∼10M)	100% (∼47M)	~10% (∼4.7M)
Cell Type Accuracy (Macro F1)	0.892	0.881	0.845	0.839
Training Time (Hours)	12.4	2.1	8.7	1.8
GPU Memory Usage (GB)	15.2	6.8	11.3	5.2
Batch Integration Score	0.781	0.772	0.723	0.714

Table 2: Zero-Shot Performance Before and After PEFT Adaptation

Dataset	scGPT (Zero-Shot)	scGPT (After PEFT)	Geneformer (Zero-Shot)	Geneformer (After PEFT)
Pancreas (AvgBIO)	0.412	0.802	0.385	0.761
Tabula Sapiens (AvgBIO)	0.523	0.845	0.481	0.812
PBMC 12k (AvgBIO)	0.612	0.881	0.523	0.792
Immune Cells (AvgBIO)	0.445	0.831	0.402	0.773

The experimental data reveals several key insights about PEFT performance across these foundation models. scGPT consistently demonstrates stronger performance metrics across both full fine-tuning and PEFT approaches compared to Geneformer, particularly in cell type annotation tasks [31] [4]. More significantly, PEFT methods achieve comparable accuracy to full fine-tuning (typically within 1-3% difference) while requiring only a fraction of the parameters and computational resources [31].

Notably, both models show substantial improvements over their zero-shot performance after PEFT adaptation, addressing a critical limitation identified in recent evaluations [4] [15]. The zero-shot analysis revealed that both scGPT and Geneformer underperformed compared to traditional methods like Harmony and scVI when used without adaptation, highlighting the essential role of PEFT for practical applications [4].

Technical Implementation: PEFT Architectures for Single-Cell Models

LoRA (Low-Rank Adaptation) Implementation

LoRA operates on the principle that weight updates during adaptation have low intrinsic rank, meaning the change in weights during fine-tuning can be represented by decomposed matrices of lower dimension [31]. For single-cell foundation models, LoRA is typically applied to the attention mechanisms within transformer blocks:

Rank Selection: Empirical studies have found optimal ranks between 4-16 for single-cell tasks, balancing efficiency and expressivity
Matrix Decomposition: The weight update ΔW is represented as BA, where A ∈ R^(r×k) and B ∈ R^(d×r) with rank r ≪ min(d,k)
Adaptation Formulation: The modified forward pass becomes h = Wx + BAx, where W remains frozen and only A and B are updated

For scGPT, LoRA modules are typically integrated into the query and value projections of the attention mechanism, while Geneformer implementations often target the key-value projections based on architectural differences [31].

Prefix Prompt Tuning for Single-Cell Data

Prefix prompt tuning extends the input sequence with trainable tokens that condition the model's behavior for specific tasks [31]. In the context of single-cell data:

Token Construction: Learnable prompt tokens are prepended to the gene sequence, typically comprising 10-20 tokens depending on task complexity
Architecture Adaptation: The attention mechanism is modified to include attention between all tokens while maintaining causal masking for the original sequence
Biological Interpretation: These learned prompts can be viewed as inducing specific cellular contexts that shift model behavior toward task-relevant patterns

PEFT Architecture Diagram: Illustrating the integration of trainable PEFT components with frozen base model parameters.

Research Reagent Solutions: Computational Tools for PEFT Implementation

Table 3: Essential Research Reagents for PEFT Implementation in Single-Cell Analysis

Resource	Type	Function	Implementation Example
scGPT Codebase	Software Framework	Provides base model architecture and PEFT integration points	`scGPT.trainer.PeftTrainer`
Geneformer HuggingFace	Model Repository	Pre-trained weights and basic fine-tuning utilities	`from transformers import GeneformerModel`
LoRA Libraries	Algorithm Implementation	Modular LoRA components for transformer architectures	`peft.LoraConfig`, `get_peft_model`
Single-Cell Benchmarks	Evaluation Datasets	Standardized datasets for comparing PEFT performance	Pancreas, Tabula Sapiens, PBMC datasets
GPU Acceleration	Hardware Infrastructure	Enables efficient training and inference	NVIDIA A100/A6000 with 40-80GB VRAM

Discussion: Implications for Research and Drug Development

The integration of PEFT methods with single-cell foundation models represents a significant advancement for biomedical research and therapeutic development. By dramatically reducing the computational barrier to adapting these powerful models, PEFT enables:

Broader Accessibility: Researchers with limited computational resources can now leverage state-of-the-art foundation models for their specific applications
Rapid Prototyping: Multiple task-specific adaptations can be developed and evaluated simultaneously due to reduced training times and resource requirements
Knowledge Preservation: The core biological understanding encoded during pretraining remains intact, maintaining model robustness across diverse applications

For drug development professionals, these efficiency gains translate to accelerated target discovery and validation. The case of scKAN demonstrates how efficient adaptation of foundation models can identify cell-type-specific therapeutic targets and even suggest drug repurposing candidates [32]. Similarly, Large Perturbation Models (LPMs) show how adapted foundation models can predict compound effects and identify mechanisms of action [12].

The comparative analysis between scGPT and Geneformer reveals that while both benefit substantially from PEFT approaches, scGPT generally demonstrates stronger adaptation capabilities across diverse tasks [31] [4]. This performance advantage, combined with its architectural flexibility, positions scGPT as the more versatile foundation for PEFT implementations in single-cell analysis.

Parameter-Efficient Fine-Tuning establishes a pragmatic pathway for maximizing the utility of single-cell foundation models while minimizing computational costs. The experimental evidence demonstrates that PEFT can achieve comparable performance to full fine-tuning while reducing parameter updates by approximately 90% [31]. This efficiency breakthrough addresses critical limitations in current single-cell foundation models, particularly their poor zero-shot performance and computational intensiveness [4] [15].

As the field progresses, several emerging trends will shape future developments in efficient adaptation methods. Multi-modal PEFT approaches that integrate transcriptomic, epigenomic, and spatial data within unified foundation models represent a promising direction [30] [33]. Additionally, automated PEFT configuration methods that dynamically optimize adapter architecture and rank selection for specific tasks and datasets could further enhance efficiency and performance.

For researchers and drug development professionals, the strategic adoption of PEPT methods enables more flexible and scalable deployment of single-cell foundation models across diverse applications. By balancing performance with efficiency, these approaches ensure that the transformative potential of single-cell AI can be realized across the broader research community, accelerating biological discovery and therapeutic development.

Impact of Pretraining Data Scale and Diversity on Model Generalization

The advent of single-cell foundation models represents a transformative shift in computational biology, offering the potential to decode cellular heterogeneity with unprecedented precision. Among these, scGPT and Geneformer have emerged as prominent frameworks based on the transformer architecture, trained on millions of single-cell transcriptomes to learn fundamental biological principles [34]. A critical factor influencing their performance is the scale and diversity of pretraining data, which theoretically enables models to capture universal patterns applicable to diverse downstream tasks [4] [1]. This guide objectively compares how differences in pretraining strategies between these models impact their generalization capabilities across key biological applications, providing researchers and drug development professionals with evidence-based insights for model selection.

Pretraining Architectures and Data Fundamentals

Model Architectures and Pretraining Objectives

While both models utilize transformer architectures, they diverge significantly in their input representation and pretraining objectives, which directly influences how they leverage pretraining data.

Geneformer employs a rank value encoding approach where genes are ranked by their expression in each cell and scaled by their expression across the entire pretraining corpus [35]. This nonparametric representation prioritizes genes that distinguish cell state while deprioritizing ubiquitously highly-expressed housekeeping genes [35]. The model is pretrained using a masked learning objective where 15% of genes in each transcriptome are masked, and the model predicts which gene should occupy each masked position [35].

scGPT utilizes a value binning strategy that segments continuous gene expression values into discrete buckets, transforming expression prediction into a classification problem [14]. The model employs an attention mask mechanism for autoregressive prediction and optimizes both cell and gene representations through self-supervised learning [14]. scGPT's pretraining incorporates both gene-prompt and cell-prompt tasks using iterative masked modeling with MSE loss [1].

Pretraining Data Composition

The scale and composition of pretraining data significantly differs between these models, affecting their biological understanding and generalization potential.

Table 1: Pretraining Data Composition Comparison

Model	Pretraining Scale	Data Composition	Species Focus	Key Features
Geneformer	~104 million human single-cell transcriptomes (V2) [35]	Non-cancerous human cells across diverse tissues [35]	Human	Excludes high mutational burden cells; rank-based encoding prioritizes distinguishing genes
scGPT	~33 million human cells [1] [14]	Diverse human cell types covering cellular heterogeneity [34]	Human	Multimodal capacity (scRNA-seq, scATAC-seq, CITE-seq) [1]

Experimental Evaluation of Model Generalization

Evaluation Methodology and Metrics

Rigorous benchmarking studies have employed standardized evaluation protocols to assess model performance across diverse tasks. The key experiments cited herein utilize zero-shot evaluation where models are applied without task-specific fine-tuning, providing insights into their inherent biological understanding gained during pretraining [4] [15].

Cell Type Clustering Evaluation:

Objective: Assess how well model embeddings separate known cell types without supervision
Metrics: Average BIO (AvgBIO) score and Average Silhouette Width (ASW) [4]
Baselines: Highly variable genes (HVG), Harmony, scVI [4]

Batch Integration Assessment:

Objective: Evaluate ability to correct for technical variations while preserving biological signals
Metrics: Batch integration scores, Principal Component Regression (PCR) [4]
Datasets: Pancreas, PBMC, Tabula Sapiens, Immune datasets [4]

Biological Insight Analysis:

Objective: Quantify capture of meaningful biological relationships
Metrics: scGraph-OntoRWR (consistency with biological knowledge), Lowest Common Ancestor Distance (LCAD) [1]
Methods: Attention weight analysis, gene network inference [1]

Quantitative Performance Comparison

Table 2: Zero-Shot Performance Comparison Across Tasks

Task	Dataset	scGPT Performance	Geneformer Performance	Top Performing Method
Cell Type Clustering	Pancreas	Underperformed scVI and Harmony [4]	Underperformed HVG across all metrics [4]	HVG [4]
Cell Type Clustering	PBMC (12k)	Outperformed scVI and Harmony [4]	Underperformed baselines [4]	scGPT [4]
Batch Integration	Pancreas	Partial batch effect correction [4]	Poor performance, structure driven by batch effects [4]	Harmony and scVI [4]
Batch Integration	Tabula Sapiens	Outperformed Harmony and scVI [4]	Consistently ranked last across metrics [4]	scGPT [4]
Gene Function Prediction	Multiple	Moderate performance [14]	Context-specific strengths [35]	CellFM (newer model) [14]

Impact of Pretraining Data Scale

Studies specifically manipulating pretraining data scale reveal nuanced relationships between data volume and model performance.

scGPT Variants Analysis: Research evaluated four scGPT variants: randomly initialized, pretrained on 814,000 kidney cells (scGPT-kidney), on 10.3 million blood and bone marrow cells (scGPT-blood), and on 33 million non-cancerous human cells (scGPT-human) [4]. Findings indicate that:

Pretraining consistently improved cell-type clustering on PBMC dataset compared to random initialization
scGPT-blood showed improved performance over scGPT-kidney on immune and Tabula Sapiens datasets
Surprisingly, scGPT-human slightly underperformed scGPT-blood even for non-blood tissue types [4]

Geneformer Scaling: Geneformer has scaled from its initial version (V1) trained on ~30 million transcriptomes to an updated version (V2) trained on ~104 million human single-cell transcriptomes [35]. The expanded pretraining corpus aims to enhance the model's fundamental understanding of network dynamics, though rigorous zero-shot evaluation of this latest version is still emerging.

Specialized vs. Generalized Pretraining Strategies

Tissue-Specific Pretraining Insights

The emergence of organ-specific foundation models offers insights into the specialization versus generalization debate in pretraining strategies.

Nephrobase Cell+, a kidney-specific foundation model pretrained on ~39.5 million single-cell and single-nucleus profiles across four mammalian species, demonstrates how targeted pretraining can outperform generalized models on tissue-specific tasks [36]. In kidney-relevant evaluations:

Nephrobase Cell+ achieved KMeans ARI of 0.82 versus 0.40 for scGPT and 0.22 for Geneformer
Produced superior cluster separation of proximal tubule and thick ascending limb cells
Achieved perfect batch mixing (cLISI = 1.00) while preserving biological signals [36]

Multimodal and Cross-Species Generalization

Models vary in their ability to generalize across experimental modalities and species, reflecting the breadth of their pretraining data.

scGPT demonstrates capabilities in integrating multi-omics data, including joint analysis of gene expression and chromatin accessibility (Multiome PBMC) and paired gene expression with protein abundance (BMMCs) [34]. The model's attention maps have been shown to capture gene network patterns, enabling biological discovery [34].

Geneformer exhibits strengths in network biology applications, showing remarkable capability in predicting dosage-sensitive disease genes and identifying candidate therapeutic targets [34] [35]. Its in silico perturbation analyses have successfully identified novel transcription factors critical to cardiomyocyte function, with experimental validation [35].

Practical Applications and Limitations

Performance in Discovery Settings

The zero-shot capabilities of foundation models are particularly crucial for discovery settings where labels are unknown or novel biological phenomena are being explored [4]. Both scGPT and Geneformer face reliability challenges in these contexts:

In clustering novel cell types, both models were outperformed by simpler methods like highly variable genes selection across most datasets [4] [15]
For batch integration, Geneformer's embeddings often showed higher proportion of variance explained by batch effects compared to original data [4]
scGPT demonstrated limited ability in gene expression prediction, often predicting median expression values regardless of true expression [15]

Interpretation of Model Generalization

The relationship between pretraining data characteristics and downstream performance appears complex and nonlinear:

Data diversity versus specificity: While increasing pretraining data diversity generally improves performance, beyond certain limits larger datasets may not confer additional benefits [4]
Task-model alignment: Performance depends on alignment between pretraining objectives and downstream tasks [1]
Architecture constraints: Model performance may be limited by architectural choices rather than pretraining data alone [4]

Factors Influencing Foundation Model Generalization

Research Reagent Solutions

Table 3: Essential Research Tools for Foundation Model Evaluation

Resource Category	Specific Tools	Function in Evaluation	Key Features
Benchmark Datasets	Tabula Sapiens, Pancreas, PBMC, Immune Datasets [4]	Standardized evaluation across tissues and technologies	Diverse biological contexts, multiple batches
Evaluation Metrics	AvgBIO, ASW, scGraph-OntoRWR, LCAD [4] [1]	Quantify biological relevance of embeddings	Connect computational outputs to biological knowledge
Baseline Methods	HVG, Harmony, scVI [4]	Performance comparison benchmarks	Establish minimum performance thresholds
Model Architectures	Geneformer (rank-based), scGPT (value binning) [1] [35]	Fundamental approach comparison	Different input representations and objectives

The generalization capabilities of single-cell foundation models demonstrate complex relationships with pretraining data scale and diversity. Current evidence suggests that while both scGPT and Geneformer benefit from large-scale pretraining, their performance gains are neither uniform nor predictable across tasks [4] [1]. scGPT shows advantages in multimodal integration and certain batch correction scenarios, particularly on datasets included in its pretraining corpus [4]. Geneformer exhibits strengths in network biology applications and in silico perturbation predictions [35]. Neither model consistently outperforms simpler baseline methods in zero-shot settings, indicating that biological insight does not automatically emerge from scale alone [4] [15]. For researchers and drug development professionals, model selection should be guided by specific task requirements, available computational resources, and the alignment between pretraining data composition and target applications. Future advancements may emerge from more strategic pretraining approaches that prioritize data quality and biological relevance over sheer volume.

Evidence-Based Benchmarking: Rigorous Multi-Task Model Validation

Synthesizing Evidence from Independent Benchmarking Studies

The emergence of single-cell RNA sequencing (scRNA-seq) has revolutionized biological research by allowing for the interrogation of transcriptomics at the single-cell level. The analysis of the vast and complex datasets generated by this technology, characterized by high sparsity, high dimensionality, and low signal-to-noise ratio, presents significant computational challenges [1]. In response, single-cell foundation models (scFMs), such as scGPT and Geneformer, have been developed. These models, often based on transformer architectures, are pretrained on millions of cells with the goal of learning universal biological patterns that can be efficiently adapted to various downstream tasks [30]. This guide synthesizes evidence from recent, independent benchmarking studies to provide an objective comparison of the performance of scGPT and Geneformer, offering researchers and drug development professionals a clear, data-driven perspective for model selection.

Performance Comparison Across Key Biological Tasks

Independent benchmarks reveal that the performance of scGPT and Geneformer is highly task-dependent, with neither model consistently outperforming the other across all scenarios. The following tables summarize quantitative results from comprehensive evaluations.

Cell-Level Task Performance

Benchmarks on core cell-level tasks like cell type annotation and batch integration show a varied performance landscape, where simpler methods often remain competitive.

Table 1: Performance Comparison on Cell Type Annotation (Clustering)

Model	Performance Summary	Key Comparative Findings
scGPT	Variable performance across datasets [4].	- Outperforms Geneformer on PBMC (12k) dataset [4].- Comparable to scVI on Tabula Sapiens, Pancreas, and PBMC (12k) datasets [4].- Generally outperformed by HVG selection and established methods like Harmony and scVI across most datasets [4].
Geneformer	Consistently underperforms relative to baselines in zero-shot cell type clustering [4].	- Outperformed by HVG selection across all metrics [4].- Shows poorer separation of known cell types compared to scGPT and baseline methods [4].
Baselines (HVG, scVI, Harmony)	Robust and often superior performance in clustering known cell types [1] [4].	- HVG selection consistently outperforms both scGPT and Geneformer [4].- scVI and Harmony are strong performers, frequently outperforming the foundation models [4].

Table 2: Performance Comparison on Batch Integration

Model	Performance Summary	Key Comparative Findings
scGPT	Effective at integrating datasets with combined technical and biological batch effects [4].	- Can outperform Harmony and scVI on complex datasets like Tabula Sapiens and Immune (which were in its pretraining data) [4].- Struggles to correct for batch effects between different experimental techniques [4].
Geneformer	Limited effectiveness in batch integration [4].	- Cell embeddings often fail to retain biological information and are primarily driven by batch effects [4].- Consistently ranks at the bottom in quantitative batch integration metrics [4].
Baselines (Harmony, scVI)	Generally strong at correcting for technical batch effects [4].	- scVI and Harmony outperform scGPT on datasets with primarily technical variation (e.g., Pancreas, PBMC) [4].

Gene-Level and Perturbation Task Performance

Predicting the effects of genetic perturbations is a challenging task where foundation models have yet to demonstrate a clear advantage over simple models.

Table 3: Performance on Perturbation Effect Prediction

Model / Baseline	Performance on Double Perturbation Prediction	Performance on Unseen Single Perturbation Prediction
scGPT	Prediction error substantially higher than the additive baseline [24].	Unable to consistently outperform the simple "mean prediction" baseline or linear models [24].
Geneformer	Prediction error substantially higher than the additive baseline [24].	Not the primary focus of this benchmark [24].
scFoundation	Prediction error substantially higher than the additive baseline [24].	Could not be robustly evaluated on standard benchmarks due to gene set requirements [24].
Additive Baseline	Best performance; predicts the sum of individual logarithmic fold changes [24].	Not applicable by definition.
"No Change" Baseline	Outperformed by the additive model but competitive with deep learning models [24].	Not applicable by definition.
Linear Model with Pretrained P	Not applicable.	Best performance; uses perturbation embeddings pretrained on other perturbation data [24].

A notable finding is that while the embeddings from scGPT and scFoundation can be repurposed, a simple linear model equipped with these pretrained gene embeddings did not consistently outperform a linear model using embeddings derived from the training data itself [24]. This suggests that the benefit of large-scale atlas pretraining for this specific task may be limited compared to pretraining on perturbation data directly [24].

Detailed Experimental Protocols from Benchmarking Studies

To ensure reproducibility and provide context for the data, this section outlines the key methodologies employed in the major benchmarking studies cited.

Protocol: Large-Scale scFM Benchmark (2025)

This study presented a comprehensive benchmark of six scFMs, including scGPT and Geneformer, against established baselines [1].

Objective: To evaluate the zero-shot performance of scFMs on biologically and clinically relevant tasks and identify their strengths and limitations [1].
Models Evaluated: Geneformer, scGPT, UCE, scFoundation, LangCell, scCello, and baseline methods (HVG selection, Seurat, Harmony, scVI) [1].
Downstream Tasks:
- Gene-Level: Gene network inference and gene functionality prediction [1].
- Cell-Level: Batch integration, cell type annotation, cancer cell identification, and drug sensitivity prediction [1].
Evaluation Metrics: 12 metrics spanning unsupervised, supervised, and novel knowledge-based approaches, including a new metric, scGraph-OntoRWR, to measure consistency of model-derived cell type relationships with established biological knowledge [1].
Key Workflow Steps:
- Feature Extraction: Zero-shot gene and cell embeddings were obtained from each pretrained scFM without further fine-tuning [1].
- Task Evaluation: These embeddings were directly used as input for the various downstream tasks [1].
- Holistic Ranking: Model performance was aggregated using a non-dominated sorting algorithm to provide task-specific and general rankings [1].

Protocol: Zero-Shot Evaluation of scGPT and Geneformer (2025)

This study focused specifically on evaluating the zero-shot capabilities of scGPT and Geneformer, a critical setting for exploratory biology where labels are unknown [4].

Objective: To assess the robustness and reliability of scGPT and Geneformer cell embeddings without any task-specific fine-tuning [4].
Tasks:
- Cell Type Clustering: Evaluating the separation of known cell types in the embedding space using metrics like Average BIO (AvgBio) score and Average Silhouette Width (ASW) [4].
- Batch Integration: Assessing the ability to mix cells from different batches while preserving biological variation, evaluated both qualitatively via visualization and quantitatively with batch integration metrics [4].
Datasets: Multiple public datasets, including Tabula Sapiens, Pancreas, and PBMC (12k), with consideration for potential overlap with model pretraining data [4].
Baselines: Performance was compared against simple baselines like Highly Variable Genes (HVG) and established integration methods (Harmony, scVI) [4].

Protocol: Perturbation Prediction Benchmark (2025)

This benchmark critically assessed the performance of foundation models and other deep learning methods on predicting transcriptomic changes after genetic perturbations [24].

Objective: To compare the prediction accuracy of deep learning models against deliberately simple baseline models for genetic perturbation effects [24].
Data: CRISPR-based perturbation datasets (e.g., Norman et al., Replogle et al.) in human cell lines (K562, RPE1) [24].
Tasks:
- Double Perturbation Prediction: Models were fine-tuned on single and some double perturbations and tested on held-out double perturbations. Prediction error was measured as the L2 distance between predicted and observed gene expression [24].
- Unseen Single Perturbation Prediction: Models were tasked with predicting outcomes of perturbing genes not seen during training [24].
Baselines:
- "No Change": Always predicts the control condition's expression [24].
- "Additive": Predicts the sum of individual logarithmic fold changes for a double perturbation [24].
- Linear Model: A simple linear model using low-dimensional representations of genes and perturbations [24].
- "Mean": Always predicts the average expression across training perturbations [24].

Visualizing Model Architectures and Benchmarking Workflows

scFM Input Tokenization Strategies

Benchmarking scFMs: A Standardized Workflow

Table 4: Essential Resources for scFM Research and Application

Resource Name	Type	Primary Function in scFM Research
CELLxGENE (CZ CELLxGENE)	Data Platform	Provides unified access to millions of curated, annotated single-cell datasets, serving as a primary source for model pretraining and benchmarking [4] [30].
scib-metrics	Software Library	Provides standardized implementations of metrics for benchmarking batch integration and bio-conservation in single-cell data [21].
BioLLM (Biological Large Language Model)	Software Framework	A unified framework that integrates diverse scFMs with standardized APIs, simplifying model access, switching, and consistent benchmarking [17].
Gene Ontology (GO)	Knowledge Base	A structured, controlled vocabulary of gene functions. Used by some models (e.g., GEARS) to inform gene relationships and for functional analysis of results [24].
CellxGene Census	Data & Model Repository	Provides access to both single-cell data and pretrained model embeddings (e.g., for scVI, Geneformer, scGPT, UCE), facilitating direct comparison and application [21].

Synthesizing evidence from independent benchmarks leads to a central, nuanced conclusion: there is no single "best" model between scGPT and Geneformer. Their performance is highly contingent on the specific task, dataset characteristics, and whether they are used in a zero-shot or fine-tuned setting [1] [4].

For cell-level tasks like batch integration involving complex biological and technical variation, scGPT generally demonstrates more robust zero-shot performance than Geneformer [4]. However, simpler, established methods like Harmony and scVI remain formidable competitors, especially for technical batch correction [4].
For gene-level and perturbation prediction, both scGPT and Geneformer, along with other specialized deep learning models, have so far failed to outperform deliberately simple linear or additive baselines [24]. This indicates that the goal of leveraging foundational pretraining for accurate, generalizable prediction of perturbation outcomes remains elusive.

Therefore, researchers and drug development professionals are advised to base their model selection on the specific requirements of their project. For exploratory analysis with unknown cell types where zero-shot application is necessary, scGPT may be a more reliable choice, though one should be aware of the limitations. For tasks with sufficient labeled data for fine-tuning, or for perturbation prediction, investing computational resources in a foundation model may not yet provide an advantage over simpler, more efficient alternatives.

In the rapidly evolving field of single-cell biology, foundation models like scGPT and Geneformer promise to revolutionize how researchers analyze cellular heterogeneity and gene regulatory networks. While both models are transformer-based architectures pretrained on millions of single-cell transcriptomes, they exhibit distinct strengths and limitations across different biological tasks. Understanding these task-specific performance characteristics is essential for researchers, scientists, and drug development professionals seeking to implement these tools effectively. This guide provides an objective comparison of scGPT's robustness across diverse applications versus Geneformer's specialized capabilities in gene-level insights, supported by recent experimental data and benchmarking studies.

Performance Comparison: Quantitative Benchmarks

Recent comprehensive evaluations reveal that neither scGPT nor Geneformer consistently outperforms the other across all tasks. Instead, each model demonstrates distinct strengths depending on the application context and evaluation metrics.

Table 1: Performance Comparison Across Key Biological Tasks

Task Category	Specific Task	scGPT Performance	Geneformer Performance	Key Benchmarking Study
Cell-level Tasks	Zero-shot cell type clustering	Inconsistent; outperforms baselines on some datasets (e.g., PBMC 12k) but underperforms on others [4]	Generally underperforms simpler methods (HVG, scVI, Harmony) across multiple datasets [4]	Kedzierska et al., 2025 [4]
	Batch integration	Effective on complex datasets with biological and technical variation; outperforms Harmony/scVI on Immune & Tabula Sapiens datasets [4]	Poor performance; embeddings often dominated by batch effects rather than biology [4]	Kedzierska et al., 2025 [4]
	Cell type annotation (fine-tuned)	Robust performance across all tasks [17]	Moderate performance	BioLLM Framework Evaluation [17]
Gene-level Tasks	Gene network inference	Moderate performance	Strong capabilities benefiting from effective pretraining strategies [17]	BioLLM Framework Evaluation [17]
	Gene function prediction	Not specified	Strong performance for identifying gene function [14]	CellFM Benchmarking [14]
Overall Versatility	Multiple tasks spanning gene and cell levels	Robust performance across all tasks [17]	Specialized strength in gene-level tasks [17]	BioLLM Framework Evaluation [17]

Experimental Protocols and Methodologies

To ensure reproducible results, understanding the experimental design behind these performance benchmarks is crucial. The following section outlines the key methodologies employed in evaluating scGPT and Geneformer.

Zero-shot Evaluation Protocol

The zero-shot evaluation paradigm is critical for assessing the fundamental biological understanding that models acquire during pretraining, without task-specific fine-tuning.

Purpose: To determine if pretraining develops transferrable biological knowledge applicable to discovery settings where labels are unknown [4]
Datasets: Multiple datasets including PBMC (12k), Tabula Sapiens, Pancreas, and Immune datasets with known cell type annotations [4]
Evaluation Metrics: Average BIO (AvgBio) score for cell type clustering, average silhouette width (ASW), and batch integration metrics [4]
Baseline Comparisons: Performance compared against established methods including Highly Variable Genes (HVG) selection, Harmony, and scVI [4]
Embedding Analysis: Visualization and quantitative assessment of model embeddings for cell type separation and batch effect correction [4]

Gene Network Inference Methodology

Gene network inference evaluates a model's ability to reconstruct biologically meaningful relationships between genes, reflecting its understanding of regulatory mechanisms.

Objective: To infer gene-gene interactions and regulatory relationships from single-cell transcriptomic data [37]
Benchmarking: Performance assessed against literature-curated networks and cell type-specific networks validated via orthogonal sequencing methods [37]
Interpretability Analysis: Attention mechanisms within transformer architectures are analyzed to identify which gene relationships the model prioritizes [1]
Validation: Comparison with established gene regulatory networks and functional enrichment analysis of identified connections [37]

Integrated Benchmarking Framework

Comprehensive frameworks like BioLLM provide standardized evaluation across diverse tasks to ensure fair model comparison.

Unified Interface: Standardized APIs that eliminate architectural and coding inconsistencies across different scFMs [17]
Task Diversity: Evaluation across both gene-level and cell-level tasks to assess generalizability [17]
Evaluation Modes: Support for both zero-shot and fine-tuning paradigms to understand how models perform under different application scenarios [17]
Knowledge-Driven Metrics: Incorporation of biological prior knowledge through metrics like scGraph-OntoRWR, which measures consistency of captured cell type relationships with established biological ontologies [1]

Model Architectures and Biological Insights

The differential performance of scGPT and Geneformer stems from their distinct architectural choices and pretraining strategies, which shape how they process and interpret single-cell data.

Architectural Differences and Tokenization Strategies

Table 2: Model Architectures and Pretraining Approaches

Feature	scGPT	Geneformer
Model Parameters	50 million [1]	40 million [1]
Pretraining Dataset Size	33 million human cells [1]	30 million single-cell transcriptomes [14]
Input Gene Selection	1200 Highly Variable Genes (HVGs) [1]	2048 ranked genes [1]
Value Representation	Value binning [1]	Gene ordering by expression level [1]
Positional Embedding	Not used [1]	Used [1]
Architecture Type	Transformer encoder with attention mask [1]	Transformer encoder [1]
Primary Pretraining Task	Iterative masked gene modeling with MSE loss [1]	Masked gene modeling with categorical gene ID prediction [1]

Biological Pathway and Workflow Diagram

The following diagram illustrates the relationship between model architectures and their resulting biological insights, highlighting how different pretraining objectives shape task-specific strengths:

The Scientist's Toolkit: Essential Research Reagents

Implementing scGPT or Geneformer in research workflows requires both computational resources and biological data of sufficient quality. The following table outlines key components of the experimental "toolkit" needed for effective model application.

Table 3: Essential Research Reagents and Resources

Resource Category	Specific Resource	Function in Evaluation	Relevance to Model Performance
Reference Datasets	CELLxGENE [4]	Provides standardized, annotated single-cell data for pretraining and evaluation	Critical for model pretraining; dataset diversity impacts generalizability
	Tabula Sapiens [4]	Multi-tissue atlas for evaluating cross-tissue performance	Tests model ability to handle biological complexity
	PBMC 12k [4]	Well-characterized immune cell dataset	Benchmark for immune cell profiling and batch integration
Computational Tools	Harmony [4]	Batch effect correction algorithm	Baseline comparison for integration tasks
	scVI [4]	Probabilistic generative model for single-cell data	Baseline for clustering and representation learning
	HVG Selection [4]	Feature selection method using highly variable genes	Simple baseline for evaluating embedding quality
Evaluation Metrics	AvgBIO Score [4]	Measures cell type clustering performance	Quantifies biological relevance of embeddings
	ASW (Average Silhouette Width) [4]	Evaluates clustering compactness and separation	Complementary metric for clustering quality
	scGraph-OntoRWR [1]	Novel metric measuring consistency with biological ontologies	Evaluates biological plausibility of learned relationships
Hardware Resources	GPU (e.g., A100 [38])	Accelerates model training and inference	Enables fine-tuning and large-scale application
	Ascend910 NPUs [14]	Specialized AI training chips	Used for training large models like CellFM

Practical Implementation Guidelines

Based on the performance characteristics and experimental results, researchers can follow these evidence-based recommendations for implementing scGPT and Geneformer in different scenarios.

Model Selection Decision Framework

Choose scGPT when: Your priority is robust performance across multiple cell-level tasks, particularly batch integration of complex datasets containing both technical and biological variation [4] [17]
Choose Geneformer when: Your research questions focus on gene-level insights, including gene network inference, gene function prediction, and understanding regulatory relationships [14] [17]
Consider alternative methods when: Working in zero-shot discovery settings with limited computational resources, as simpler methods like HVG selection or Harmony may outperform foundation models for specific tasks [4]

Optimization Strategies

For scGPT: Leverage its strong zero-shot batch integration capabilities when analyzing datasets with multiple experimental conditions or technologies [4]
For Geneformer: Utilize its attention mechanisms to extract interpretable gene-gene relationships for hypothesis generation about regulatory mechanisms [1]
General practice: Always compare foundation model performance against simpler baseline methods specific to your task, as traditional algorithms may provide sufficient performance with lower computational costs [4] [15]

scGPT and Geneformer represent significant advances in single-cell computational biology, but their distinct architectural choices and training objectives lead to specialized strengths. scGPT demonstrates more robust performance across diverse cell-level tasks, particularly in batch integration scenarios involving complex biological and technical variations. In contrast, Geneformer excels at gene-level insights, showing stronger capabilities in gene network inference and function prediction. Researchers should select between these models based on their specific biological questions, prioritizing scGPT for atlas-level integration tasks and Geneformer for investigating gene regulatory mechanisms. As both models continue to evolve, ongoing benchmarking against traditional methods remains essential to ensure biological insights derive from meaningful computational advances rather than architectural complexity alone.

This guide provides a quantitative comparison of two prominent single-cell foundation models (scFMs), scGPT and Geneformer, focusing on their performance in bio-conservation, batch correction, and predictive accuracy. The evaluation is based on recent benchmark studies to inform researchers and drug development professionals in their model selection process.

Table 1: Cell-level Task Performance Comparison

This table summarizes model performance on key cell-level tasks, including cell type annotation (bio-conservation) and batch integration, as measured by established metrics. A higher score is better for all metrics. [1] [4]

Task	Metric	scGPT	Geneformer	Top Performing Baseline
Cell Type Annotation(AvgBIO Score)	AvgBIO (Pancreas)	~0.45	~0.35	HVG (~0.65)
	AvgBIO (Immune)	~0.55	~0.40	Harmony (~0.70)
	ASW (Tabula Sapiens)	~0.75	~0.65	scGPT (~0.75)
Batch Integration(Batch Mixing Score)	iLISI (Pancreas)	~0.60	~0.40	HVG (~0.85)
	iLISI (PBMC)	~0.70	~0.45	scVI (~0.80)
	PCR (Immune)	~0.30	~0.15	Harmony (~0.35)

Table 2: Gene-level and Perturbation Task Performance

This table compares performance on gene function prediction and perturbation response modeling, which are critical for predictive accuracy and therapeutic discovery. [1] [14] [12]

Task	Metric	scGPT	Geneformer	Notes
Gene Function Prediction	AUC (GO Term Prediction)	0.72	0.75	Geneformer benefits from effective pretraining on gene relationships [17].
Perturbation Outcome Prediction	Pearson r (Transcriptome)	0.25 (with fine-tuning)	0.28 (with fine-tuning)	The Large Perturbation Model (LPM) significantly outperformed both (r > 0.45) [12].
Zero-shot Gene Expression Prediction	Correlation	Poor (predicts median value)	Not Evaluated	scGPT showed limited ability without conditioning on cell embeddings [4] [15].

Experimental Protocols for Key Evaluations

Evaluation of Bio-conservation and Batch Correction

Objective: To assess the models' ability to generate cell embeddings that preserve biological cell types (bio-conservation) while removing non-biological technical variations (batch correction) [1] [4].

Workflow:

Input Data: Processed single-cell RNA-seq datasets (e.g., Pancreas, PBMC, Tabula Sapiens) containing cells with annotated cell types and batch/sample labels [4].
Feature Extraction: Generate cell embeddings in a zero-shot setting using the pre-trained scGPT and Geneformer models without any task-specific fine-tuning [4] [15].
Baseline Methods: Compare against:
- HVG (Highly Variable Genes): Using the top 2,000 most variable genes.
- Harmony: An integration algorithm that iteratively corrects PCA embeddings.
- scVI: A probabilistic generative model for single-cell data [4].
Quantitative Metrics:
- Bio-conservation: Average BIO (AvgBIO) score and Average Silhouette Width (ASW) computed on cell type labels. Higher values indicate better separation of cell types.
- Batch Correction: Integration Local Inverse Simpson's Index (iLISI) and Principal Component Regression (PCR) score computed on batch labels. Higher iLISI and lower PCR indicate better batch mixing [1] [4].

Zero-Shot Evaluation Workflow for Cell Embeddings

Evaluation of Predictive Accuracy in Perturbation Tasks

Objective: To benchmark the models' accuracy in predicting gene expression changes in response to genetic or chemical perturbations [12].

Workflow:

Input Data: Perturbation experiment datasets (e.g., from LINCS) containing pre- and post-perturbation transcriptomes.
Model Fine-tuning: Fine-tune scGPT and Geneformer on a subset of perturbation data according to their respective prescribed methodologies [12].
Baseline Methods: Compare against:
- CPA (Compositional Perturbation Autoencoder): Predicts effects of unseen perturbation combinations.
- GEARS: Predicts effects of unseen genetic perturbations.
- LPM (Large Perturbation Model): A decoder-only model integrating diverse perturbation data [12].
Quantitative Metric: The Pearson correlation coefficient between the model's predicted gene expression and the held-out experimentally observed gene expression post-perturbation [12].

Workflow for Perturbation Prediction Benchmarking

Table 3: Essential Materials for scFM Evaluation

This table lists the key datasets, metrics, and computational tools used in the benchmark studies, providing a practical resource for replicating or extending this research. [1] [4] [14]

Category	Item	Function in Evaluation
Benchmark Datasets	Human Cell Atlas (e.g., from CellxGene)	Provides large-scale, diverse human scRNA-seq data for pre-training and benchmarking [1] [14].
	Pancreas Dataset	A standard benchmark with multiple batches and techniques for evaluating batch correction [4].
	Perturbation Datasets (e.g., LINCS)	Contains genetic and chemical perturbation data for testing predictive accuracy [12].
Evaluation Metrics	AvgBIO / ASW	Quantifies how well an embedding preserves biological cell type identity (bio-conservation) [1] [4].
	iLISI / PCR	Quantifies how well technical batch effects have been removed (batch correction) [1] [4].
	Pearson Correlation	Measures accuracy in predicting continuous outcomes, such as gene expression after perturbation [12].
Software & Models	BioLLM Framework	A unified framework that provides standardized APIs for integrating and evaluating different scFMs, streamlining model comparison [17].
	Harmony / scVI	Established baseline methods for data integration against which new foundation models are compared [1] [4].
	HVG Selection	A simple, yet strong baseline for feature selection that often competes with or outperforms complex foundation models in zero-shot tasks [4] [15].

In the rapidly evolving field of single-cell transcriptomics, foundation models like scGPT and Geneformer promise to revolutionize biological discovery. However, comprehensive benchmarking reveals a critical insight: no single model consistently outperforms all others across diverse tasks. Performance is highly dependent on the specific application, with scGPT generally demonstrating stronger all-around capabilities, particularly in cell-level tasks, while Geneformer shows strengths in certain gene-level analyses. This guide provides an objective comparison of their performance, supported by experimental data, to inform researchers and drug development professionals in selecting the appropriate tool for their specific needs.

Single-cell RNA sequencing (scRNA-seq) provides unprecedented insights into cellular heterogeneity, yet analyzing this data presents challenges due to its high dimensionality, sparsity, and technical noise. Single-cell foundation models (scFMs), pre-trained on millions of cells, aim to learn universal biological representations that can be adapted to various downstream tasks. Two prominent models, scGPT and Geneformer, have emerged with different architectural approaches and training methodologies.

scGPT employs a transformer architecture with a value categorization strategy, binning gene expression values into discrete buckets. Pre-trained on over 33 million human cells, it uses an attention mask mechanism for autoregressive prediction and is designed for diverse tasks including cell-type annotation, batch integration, and gene network inference [14] [1] [38].

Geneformer utilizes a gene-ranking approach, representing cells as sequences of genes ordered by expression levels. Pre-trained on approximately 30 million single-cell transcriptomes, it uses a masked language model objective where the model predicts the identity of masked genes based on context [14] [1].

Both models follow a "pre-train then fine-tune" paradigm, but their zero-shot performance—using pre-trained models without task-specific fine-tuning—is critical for discovery settings where labels are unknown [4].

Performance Benchmarking Across Critical Tasks

Rigorous evaluations of scGPT and Geneformer reveal a task-dependent performance landscape. The following comparative analysis synthesizes findings from multiple benchmarking studies to provide a holistic view of their capabilities.

Zero-Shot Cell Type Clustering

Cell type clustering is a fundamental task in single-cell analysis where models must group cells by biological function rather than technical batch effects. In zero-shot settings, where models are applied without fine-tuning, both scGPT and Geneformer show significant limitations.

Table 1: Zero-Shot Cell Type Clustering Performance (AvgBIO Score)

Model	PBMC (12k)	Tabula Sapiens	Pancreas	Immune Dataset
scGPT	0.65	0.48	0.42	0.45
Geneformer	0.38	0.31	0.35	0.33
scVI	0.63	0.59	0.58	0.62
Harmony	0.61	0.55	0.56	0.58
HVG	0.68	0.62	0.61	0.64

Evaluation across five datasets shows that both foundation models are outperformed by simpler methods like Highly Variable Genes (HVG), scVI, and Harmony. Geneformer particularly struggles, with performance substantially below other methods. scGPT shows more competitive performance on the PBMC dataset but remains inferior to established baselines on others [4].

Batch Integration Performance

Batch integration removes technical variations between datasets while preserving biological signals. This is crucial for combining data from multiple sources.

Table 2: Batch Integration Performance (Batch Mixing Score)

Model	Pancreas	PBMC	Tabula Sapiens	Immune Dataset
scGPT	0.52	0.61	0.72	0.69
Geneformer	0.31	0.35	0.38	0.41
scVI	0.71	0.75	0.65	0.58
Harmony	0.69	0.72	0.55	0.73
HVG	0.76	0.78	0.74	0.75

In batch integration, Geneformer consistently ranks last across all datasets, often increasing batch effects compared to raw data. scGPT shows intermediate performance, outperforming scVI and Harmony on complex datasets with biological batch effects (Tabula Sapiens and Immune) but underperforming on datasets with purely technical variation [4].

Gene Perturbation Effect Prediction

Predicting transcriptional responses to genetic perturbations is a key application for therapeutic development. Surprisingly, foundation models show limited capability in this domain.

Table 3: Perturbation Prediction Performance (L2 Distance for Top 1000 Genes)

Model	Double Perturbation (Norman et al.)	Unseen Single Perturbation (Replogle et al.)
scGPT	6.21	5.89
Geneformer*	6.85	6.92
scFoundation	6.45	N/A
Additive Baseline	5.72	5.65
No Change Baseline	6.15	5.94

Note: Geneformer and other models not designed for perturbation prediction were repurposed with a linear decoder [24].

None of the deep learning models outperformed deliberately simple baselines, including an additive model that sums individual perturbation effects. This indicates current foundation models have limited ability to generalize to perturbation prediction despite significant computational resources required for fine-tuning [24].

A comprehensive 2025 benchmark evaluating six scFMs across two gene-level and four cell-level tasks provides holistic rankings:

Table 4: Overall Model Rankings by Task Type

Model	Cell-Level Tasks	Gene-Level Tasks	Overall Ranking
scGPT	1	2	1
Geneformer	4	1	3
scFoundation	3	3	2
UCE	2	4	4
scBERT	5	5	5

scGPT demonstrates robust performance across all tasks, particularly excelling in cell-level applications. Geneformer shows stronger performance in gene-level tasks, benefiting from its effective pretraining strategy, but lags in cell-level applications [1] [17].

Experimental Protocols for Benchmarking

To ensure reproducible evaluations, benchmarking studies follow standardized protocols across tasks. Below are the methodologies for key experiments cited in this guide.

Zero-Shot Cell Type Clustering Protocol

Objective: Evaluate model-generated cell embeddings' ability to separate known cell types without task-specific training.

Dataset Preparation:

Utilize diverse biological datasets (e.g., Pancreas, PBMC, Tabula Sapiens) with established cell type labels
Apply standard preprocessing: quality control, normalization, and filtering
For foundation models: input data according to their specific requirements (e.g., ranked genes for Geneformer, binned expressions for scGPT)

Embedding Generation:

Extract cell embeddings from pre-trained models without fine-tuning
Generate comparable embeddings from baseline methods (scVI, Harmony, HVG)
For HVG baseline: select 2,000 most highly variable genes as features

Evaluation Metrics:

Calculate Average BIO score (AvgBIO) combining multiple cluster similarity metrics
Compute Average Silhouette Width (ASW) with respect to cell type labels
Perform Leiden clustering on embeddings followed by Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) calculations against ground truth labels

Statistical Analysis:

Repeat evaluations across multiple random seeds
Compare performance across datasets and models using standardized scoring [4] [21]

Batch Integration Assessment Protocol

Objective: Quantify model's ability to remove technical batch effects while preserving biological variation.

Dataset Selection:

Curate datasets with known batch effects (e.g., Pancreas dataset with 5 different experimental sources)
Include both technical replicates (same cell type, different protocols) and biological variations

Integration Workflow:

Generate integrated embeddings using each model's standard pipeline
For baseline methods: apply standard integration procedures (Harmony integration, scVI latent space manipulation)
Ensure all methods output embeddings in comparable dimensionalities

Evaluation Framework:

Batch mixing metrics: Calculate silhouette scores with respect to batch labels (lower indicates better mixing)
Bio-conservation metrics: Compute silhouette scores with respect to cell type labels (higher indicates better biological preservation)
Principal Component Regression (PCR): Quantify proportion of variance explained by batch effects after integration
Visual inspection: UMAP visualization to assess qualitative integration performance

Quantitative Ranking:

Aggregate scores across multiple metrics and datasets
Generate overall rankings for batch correction effectiveness [4] [21]

Perturbation Prediction Methodology

Objective: Assess model capability to predict gene expression changes after genetic perturbations.

Data Sources:

Utilize CRISPR-based perturbation datasets (Norman et al. for double perturbations, Replogle et al. for single perturbations)
Include both seen perturbations (in training) and unseen perturbations (held out for testing)

Experimental Setup:

Double perturbation prediction: Train on all single perturbations and half of double perturbations, test on remaining double perturbations
Unseen perturbation prediction: Train on subset of perturbations, test on completely held-out perturbations
Baselines: Implement simple additive model (sum of individual perturbation effects) and "no change" baseline

Model Adaptation:

For foundation models not designed for perturbation prediction: add linear decoder to map cell embeddings to gene expression space
Fine-tune models following authors' recommended procedures when available

Evaluation Metrics:

L2 distance between predicted and observed expression values
Focus on top 1,000 highly expressed or differentially expressed genes
Assess genetic interaction prediction capability through true-positive/false-discovery analysis [24]

Visualization of Model Evaluation Workflow

The following diagram illustrates the standardized evaluation workflow used in benchmarking studies to ensure fair comparison across models:

Diagram Title: scFM Evaluation Workflow

Performance Relationship Mapping

The following diagram illustrates the complex relationship between model characteristics, task types, and performance outcomes based on benchmarking results:

Diagram Title: Model-Task Performance Relationships

The Scientist's Toolkit: Essential Research Reagents

To implement and evaluate single-cell foundation models effectively, researchers require specific computational tools and resources. The following table details key components of the experimental ecosystem:

Table 5: Essential Research Reagents for scFM Evaluation

Resource Category	Specific Tools	Function & Purpose
Benchmarking Datasets	Norman et al. perturbation data, Tabula Sapiens, Pancreas datasets	Provide standardized biological contexts with high-quality ground truth labels for fair model comparison
Evaluation Metrics	AvgBIO score, ASW, Batch mixing scores, L2 distance	Quantitatively measure model performance across different task dimensions using established statistical measures
Baseline Methods	HVG selection, scVI, Harmony, Additive model	Serve as performance baselines to contextualize foundation model results and prevent exaggerated claims
Computational Frameworks	BioLLM, scib-metrics, Census API	Provide standardized interfaces for model access, evaluation, and comparison across heterogeneous architectures
Visualization Tools	UMAP, t-SNE, Graphviz	Enable qualitative assessment of embeddings and experimental workflows through dimensionality reduction and diagramming

These standardized resources enable reproducible benchmarking and prevent evaluation artifacts that might favor specific model architectures [4] [21] [24].

The comprehensive benchmarking data reveals that neither scGPT nor Geneformer universally dominates across all applications. Instead, model selection should be guided by specific research needs:

For cell-level tasks and batch integration: scGPT generally provides more robust performance, particularly in zero-shot settings where immediate application without fine-tuning is required [4] [17].
For gene-level functional analysis: Geneformer demonstrates strengths, benefiting from its pretraining approach that captures gene relationships effectively [1] [17].
For perturbation prediction: Surprisingly, simple linear baselines currently outperform both foundation models, suggesting caution when applying these models to therapeutic development applications [24].
For exploratory analysis with unlabeled data: scGPT's zero-shot embeddings provide a reasonable starting point, but practitioners should maintain simpler baselines like HVG selection as competitive alternatives [4] [38].

The absence of a single universal winner underscores the importance of task-specific model selection. Researchers should consider dataset characteristics, computational resources, and specific biological questions when choosing between scGPT, Geneformer, or simpler alternative methods. As the field evolves, continued rigorous benchmarking remains essential to translate model capabilities into genuine biological insights and therapeutic advances.

Conclusion

The benchmarking evidence clearly indicates that while scGPT and Geneformer represent significant advancements, neither consistently outperforms well-established, simpler methods like PCA, scVI, or HVG selection in zero-shot settings. scGPT often demonstrates more robust overall performance across diverse tasks, whereas Geneformer shows specific strengths in gene-level analyses. The choice between them should be guided by the specific biological task, dataset characteristics, and available computational resources. For the field to progress, future development must prioritize rigorous zero-shot evaluation, improved pretraining objectives that capture deeper biological relationships, and the creation of standardized frameworks like BioLLM for fair comparison. The ultimate goal remains the development of models that genuinely learn and generalize biological principles to accelerate drug discovery and clinical translation.