Benchmarking scGPT vs. scFoundation: A Comprehensive Performance Analysis for Single-Cell Biology

Hazel Turner Nov 27, 2025 241

This article provides a systematic evaluation of two leading single-cell foundation models, scGPT and scFoundation, based on the latest benchmarking studies.

Benchmarking scGPT vs. scFoundation: A Comprehensive Performance Analysis for Single-Cell Biology

Abstract

This article provides a systematic evaluation of two leading single-cell foundation models, scGPT and scFoundation, based on the latest benchmarking studies. It explores their foundational concepts and architectures, examines their methodological applications in tasks like drug response prediction and perturbation modeling, identifies key performance limitations and optimization strategies, and delivers a rigorous comparative analysis across multiple biological contexts. Aimed at researchers, scientists, and drug development professionals, this review synthesizes critical insights to guide model selection and application, highlighting current challenges and future directions for integrating AI into biomedical research.

Understanding scGPT and scFoundation: Core Architectures and Pretraining Paradigms

Defining Single-Cell Foundation Models (scFMs) and Their Role in Biology

Introduction to scFMs
Head-to-Head Performance Comparison
Experimental Protocols for Benchmarking
Visualizing the scFM Workflow
The Scientist's Toolkit: Essential Research Reagents & Materials

Single-cell Foundation Models (scFMs) are large-scale deep learning models, typically based on transformer architectures, that are pretrained on vast datasets of single-cell RNA sequencing (scRNA-seq) data [1]. The core concept draws an analogy from natural language processing: treating a cell as a "sentence" and its constituent genes as "words" [1] [2]. By training on millions of cells across diverse tissues, conditions, and species, these models aim to learn fundamental principles of cellular biology and gene-gene interactions in a self-supervised manner [3] [1]. This pretraining allows scFMs to develop rich, internal representations of biological knowledge, which can then be adapted—or fine-tuned—for a wide array of downstream tasks without the need to train a new model from scratch for each specific application [1].

The emergence of scFMs addresses critical challenges in single-cell data analysis, including the characteristically high sparsity, high dimensionality, and technical noise of transcriptome data [3] [4]. They offer a promising unified framework for integrating and comprehensively analyzing the rapidly expanding repositories of single-cell data [1]. Two prominent examples of such models are scGPT and scFoundation, which have been the subject of extensive benchmarking studies to evaluate their respective strengths and limitations [3] [5] [6].

Head-to-Head Performance Comparison

Comprehensive benchmarking reveals that no single scFM consistently outperforms all others across every possible task [3] [6]. Model performance is highly dependent on the specific downstream application, dataset size, and the biological question being asked. The following tables summarize the comparative performance of scGPT and scFoundation across key biological tasks, based on recent, rigorous evaluations.

Table 1: Performance Comparison on Cell-Level Tasks

Task	Description	scGPT Performance	scFoundation Performance	Key Findings
Cell Type Annotation	Classifying cell identity from gene expression.	Superior in zero-shot settings; achieves better cell type separation in embeddings [6].	Competitive, but generally outperformed by scGPT in independent benchmarks [6].	scGPT's architecture is particularly proficient at preserving biologically relevant information, enhancing cell type clustering [6].
Batch Integration	Correcting for technical variations between datasets.	Superior at removing batch effects while preserving biological variation in zero-shot tasks [3] [6].	Effective at distinguishing certain cell types, but generally less effective at batch correction than scGPT [6].	A unified framework found scGPT outperformed other models, including scFoundation, on batch-effect-removal metrics [6].
Cancer Cell Identification	Identifying malignant cells within a tumor microenvironment.	Robust and versatile performance across diverse applications and cancer types [3].	Robust and versatile performance across diverse applications and cancer types [3].	Both models demonstrated utility in this clinically relevant task, with no single model being a clear winner in all contexts [3].

Table 2: Performance Comparison on Gene-Level and Perturbation Tasks

Task	Description	scGPT Performance	scFoundation Performance	Key Findings
Perturbation Prediction	Predicting gene expression changes after a genetic or chemical intervention.	Underperformed compared to simpler baseline models (e.g., Random Forest with GO features) [5].	Underperformed compared to simpler baseline models, including the "Train Mean" baseline [5].	A key study found that even the simplest baseline model (predicting the mean of training data) could outperform these foundation models on certain Perturb-seq benchmarks [5].
Gene Function Prediction	Inferring gene function and relationships from embeddings.	Strong capabilities, benefiting from effective pretraining strategies [6].	Strong capabilities in gene-level tasks [6].	Both models automatically learn a gene embedding matrix that can be leveraged for predicting biological relationships [3] [6].

Experimental Protocols for Benchmarking

The performance data presented in the previous section are derived from standardized benchmarking frameworks designed to ensure a fair and rigorous comparison. The core methodology involves a "zero-shot" or "fine-tuning" evaluation of the model's learned representations on specific, held-out downstream tasks [3] [6].

Benchmarking Workflow

A typical benchmarking pipeline involves several critical stages:

Feature Extraction: Zero-shot cell or gene embeddings are extracted from the pre-trained scFMs without any further task-specific training. This tests the intrinsic biological knowledge captured during pre-training [3] [4].
Downstream Task Execution: These embeddings are then used as input to various downstream tasks. For cell-level tasks like annotation, a simple classifier (e.g., logistic regression) is often trained on the embeddings. For gene-level tasks, the similarity between gene embeddings is evaluated against known biological databases [3].
Performance Quantification: Model performance is evaluated using a battery of metrics. These can include standard metrics like clustering accuracy, as well as novel, biology-informed metrics like scGraph-OntoRWR (which measures consistency of cell-type relationships with prior biological knowledge) and Lowest Common Ancestor Distance (LCAD) (which assesses the severity of cell type misannotation errors) [3] [4].

Key Evaluation Metrics

Benchmarking studies employ a diverse set of metrics to holistically assess model performance [3]:

Cell Embedding Quality: Measured by the Average Silhouette Width (ASW), which indicates how well-separated different cell types are in the latent space [6].
Biological Fidelity: Evaluated through gene regulatory network (GRN) analysis and the novel ontology-based metrics (scGraph-OntoRWR, LCAD) that compare model outputs to established biological knowledge [3] [6].
Prediction Accuracy: For tasks like cell annotation and perturbation prediction, standard classification and regression metrics are used, such as Pearson correlation between predicted and actual pseudo-bulk expression profiles in differential expression space [5].

Visualizing the scFM Workflow

The diagram below illustrates the standard lifecycle of a single-cell Foundation Model, from pretraining on large-scale data to application on downstream biological tasks.

Lifecycle of a Single-Cell Foundation Model

The Scientist's Toolkit: Essential Research Reagents & Materials

Successfully applying and benchmarking scFMs requires a combination of computational tools, software frameworks, and curated biological data resources. The following table details key components of the modern computational biologist's toolkit for working with models like scGPT and scFoundation.

Table 3: Essential Resources for scFM Research

Category	Item / Tool	Function & Description
Software & Frameworks	BioLLM	A unified framework that standardizes the deployment of various scFMs (like scGPT and scFoundation) through consistent APIs, enabling seamless model switching and comparative benchmarking [6].
Data Resources	CZ CELLxGENE	A curated atlas and database that provides unified access to millions of annotated single-cell datasets, often used for model pretraining and as a source of high-quality, independent validation data [3] [1].
Data Resources	Perturb-seq Datasets	High-throughput single-cell datasets combining CRISPR-based genetic perturbations with sequencing. They serve as the primary benchmark for evaluating a model's ability to predict cellular responses to genetic interventions [5].
Baseline Models	Traditional ML Models (e.g., RF, kNN)	Simple machine learning models like Random Forest (RF) and k-Nearest Neighbors (kNN) are used as critical baselines. They help determine if the complexity of a foundation model provides a tangible performance benefit for a given task [3] [5].
Evaluation Metrics	Cell Ontology-Informed Metrics (e.g., LCAD)	Novel metrics that incorporate prior biological knowledge from cell ontologies to assess whether model errors are biologically reasonable (e.g., misclassifying a T-cell as a B-cell is less severe than misclassifying it as a neuron) [3] [4].
Gene Embedding Baselines	Functional Representation of Gene Signatures (FRoGS)	An alternative method for generating gene embeddings via random walks on a biological hypergraph. Used as a baseline to evaluate the quality of gene representations learned by scFMs [3].

Single-cell RNA sequencing (scRNA-seq) has revolutionized biology by allowing researchers to probe cellular heterogeneity at an unprecedented resolution. The emergence of single-cell foundation models (scFMs), inspired by breakthroughs in large language models (LLMs), represents a paradigm shift in how this complex data is analyzed. These models, pre-trained on massive collections of single-cell data, aim to learn universal patterns of cellular biology that can be adapted to diverse downstream tasks. Among these, scGPT and scFoundation have emerged as prominent transformer-based models. This guide provides an objective comparison of their performance, underpinned by experimental data from recent benchmarking studies, to inform researchers and drug development professionals about their respective strengths and limitations.

Architectural and Methodological Deep Dive

Core Architectures of scGPT and scFoundation

The design philosophies of scGPT and scFoundation, while both rooted in transformer architecture, differ in ways that influence their capabilities and performance.

scGPT leverages a generative pre-trained transformer architecture, specifically designed to handle the non-sequential nature of gene expression data [1] [7]. Its input processing creates a composite embedding for each gene by combining its identity (a unique gene token) and its expression value (often binned into discrete values) [8]. A key innovation is its use of a specialized attention mask within its transformer blocks, which allows for generative pre-training on gene expression profiles without relying on a fixed gene order [2] [7]. scGPT was pre-trained on a massive corpus of over 33 million human cells from 51 organs and 441 studies, collated from the CELLxGENE collection [9] [10] [2].

scFoundation, in contrast, employs an asymmetric encoder-decoder architecture [4]. It is designed to process a much larger input gene set, encompassing nearly all ~19,000 human protein-encoding genes along with common mitochondrial genes [5] [4]. Its pre-training strategy incorporates a read-depth-aware masked gene modeling (MGM) objective, using a mean squared error (MSE) loss to reconstruct masked gene expressions [4].

Table: Architectural Comparison of scGPT and scFoundation

Feature	scGPT	scFoundation
Core Architecture	GPT-like (Decoder-based)	Asymmetric Encoder-Decoder
Model Parameters	~50 million [4]	~100 million [4]
Pre-training Dataset Size	~33 million cells [4] [10]	~50 million cells [4]
Input Gene Handling	~1,200 Highly Variable Genes (HVGs) [4]	~19,264 protein-encoding genes [4]
Value Embedding	Expression value binning [8]	Value projection [4]
Positional Embedding	Not used [4]	Not used [4]

Experimental Workflow for Benchmarking Foundation Models

The evaluation of scFMs like scGPT and scFoundation follows a structured pipeline to ensure fair and informative comparisons. The following diagram visualizes a typical benchmarking workflow as implemented in frameworks like BioLLM [4] [6].

Diagram: Benchmarking Workflow for Single-Cell Foundation Models

Performance Benchmarking Across Key Tasks

Cell-Level Tasks: Embedding and Annotation

A primary application of scFMs is to generate meaningful representations (embeddings) of cells that capture biological state, which is crucial for tasks like cell type annotation and batch integration.

In a comprehensive benchmark by BioLLM, which evaluated zero-shot cell embeddings using metrics like Average Silhouette Width (ASW) to measure cluster purity, scGPT consistently outperformed other models, including scFoundation [6]. scGPT's embeddings provided superior separation of cell types in visualizations and demonstrated greater effectiveness in integrating data across batches, though it, like other models, struggled to correct for strong batch effects across different sequencing technologies [6]. Another independent study confirmed that fine-tuned scGPT outperformed Geneformer in cell type annotation, though it noted that inconsistent results across studies highlight the importance of proper adaptation techniques like Parameter-Efficient Fine-Tuning (PEFT) [8].

Perturbation Response Prediction

Predicting cellular transcriptional responses to genetic perturbations is a rigorous test of a model's grasp of gene regulatory mechanics. A dedicated benchmarking study yielded surprising results [5].

The study evaluated models on their ability to predict post-perturbation gene expression profiles (in differential expression space) across four Perturb-seq datasets. The results demonstrated that even a simple baseline model (Train Mean), which predicts the average expression profile from the training data, could outperform the fine-tuned foundation models. More notably, a Random Forest (RF) regressor using prior biological knowledge like Gene Ontology (GO) vectors outperformed scGPT by a large margin [5].

Table: Performance in Perturbation Prediction (Pearson Correlation in Differential Expression Space)

Model	Adamson Dataset	Norman Dataset	Replogle (K562)	Replogle (RPE1)
Train Mean (Baseline)	0.711	0.557	0.373	0.628
scGPT (Fine-tuned)	0.641	0.554	0.327	0.596
scFoundation (Fine-tuned)	0.552	0.459	0.269	0.471
RF with GO Features	0.739	0.586	0.480	0.648
RF with scGPT Embeddings	0.727	0.583	0.421	0.635

An important finding was that using the pre-trained embeddings from scGPT as features for a Random Forest model led to better performance than using the fine-tuned scGPT model itself, though it still fell short of the RF model with GO features [5]. This suggests that while scGPT's embeddings contain valuable biological information, the full fine-tuning pipeline may not be leveraging it optimally for this specific task.

Drug Response Prediction

The application of scFMs to predict patient-specific or cell-specific responses to drugs is highly relevant for therapeutic development. The scDrugMap benchmark provides insights here [11].

In pooled-data evaluation (training and testing on mixed datasets), scFoundation achieved the best performance, with a mean F1 score of 0.971, outperforming other models by a significant margin [11]. However, in the more challenging cross-data evaluation (testing on datasets not seen during training), which better assesses model generalizability, scGPT excelled in zero-shot learning (mean F1: 0.858), while another model, UCE, performed best after fine-tuning [11]. This indicates a trade-off: scFoundation may achieve higher peak performance on familiar data distributions, while scGPT shows stronger inherent generalization in some contexts.

The Scientist's Toolkit: Essential Research Reagents

Benchmarking studies rely on a suite of computational tools and data resources. The table below details key components used in the evaluations discussed.

Table: Key Reagents for Single-Cell Foundation Model Research

Reagent / Resource	Type	Function in Research
Perturb-seq Datasets [5]	Experimental Data	Provides ground-truth data (genetic perturbation + scRNA-seq) for evaluating model predictions of causal cellular responses.
CELLxGENE Atlas [9] [2]	Data Repository	A primary source of millions of curated single-cell datasets used for pre-training and as a reference for model applications.
BioLLM Framework [6]	Software Tool	A unified framework that standardizes the integration, fine-tuning, and evaluation of different scFMs, ensuring fair comparisons.
Gene Ontology (GO) Vectors [5]	Prior Knowledge	Structured, biologically grounded feature sets used to build powerful baseline models for tasks like perturbation prediction.
Parameter-Efficient Fine-Tuning (PEFT) [8]	Computational Method	Adaptation techniques like LoRA that efficiently tailor large scFMs to new tasks, reducing computational cost and catastrophic forgetting.

The benchmarking data reveals that the competition between scGPT and scFoundation is not a simple matter of one being universally superior. Instead, each model demonstrates distinct strengths, a finding consistent with a broader benchmark concluding that "no single scFM consistently outperforms others across all tasks" [4].

scGPT shows robust and often superior performance in cell-level tasks like annotation and batch integration, generates high-quality zero-shot embeddings, and generalizes well in challenging cross-data drug response prediction [6] [11]. Its architecture appears well-suited for learning generalizable representations of cellular state.
scFoundation can achieve top-tier performance on specific prediction tasks when data conditions are favorable, as seen in the pooled-data drug response benchmark [11]. Its capacity to process a full gene set may provide an advantage in certain contexts.

A critical insight from the perturbation prediction benchmarks is that foundation models do not always outperform simpler, biologically-informed baseline models [5]. This highlights the necessity of including such baselines in evaluations to properly assess the value added by these large-scale models.

For researchers, the choice between scGPT, scFoundation, or a simpler alternative should be guided by the specific downstream task, dataset size, available computational resources, and the need for biological interpretability. As the field matures, standardized frameworks like BioLLM and continued rigorous benchmarking will be essential for guiding the effective application of these powerful tools in biological discovery and drug development.

A Benchmarking Perspective on Single-Cell Foundation Models

The development of single-cell foundation models (scFMs) represents a significant push in computational biology, aiming to leverage large-scale datasets to build models that can generalize across diverse biological tasks. Among these, scFoundation is a prominent model that utilizes an asymmetric transformer architecture and was pre-trained on approximately 50 million human cells [12] [13] [4]. This guide objectively situates scFoundation's performance within the competitive landscape of single-cell foundation models, focusing on direct comparisons with alternatives like scGPT, Geneformer, and UCE, based on recent, rigorous benchmarking studies.

Model Architecture & Pretraining

The performance of any foundation model is fundamentally shaped by its architectural choices and the scale of its training data.

scFoundation: Employs an asymmetric encoder-decoder transformer architecture and is categorized as a value projection-based model [4]. This approach aims to preserve the full resolution of gene expression data by directly predicting raw expression values. Its pretraining was conducted on a corpus of around 50 million human cells, resulting in a model with ~100 million parameters [12] [4]. The pretraining task was a read-depth-aware masked gene modeling (MGM) objective, optimized using a Mean Squared Error (MSE) loss [4].
scGPT: Also a value projection model, scGPT uses a standard transformer encoder architecture and incorporates an attention mask mechanism [13] [4]. It segments gene expression values into bins, treating the prediction as a regression task. It was pretrained on over 33 million human cells (non-cancerous) and has ~50 million parameters [12] [4]. Its pretraining combines both generative objectives and iterative MGM.
Geneformer: This model adopts a different strategy, based on the ordering of genes by expression level [13] [4]. It is a rank-based model that learns by predicting gene positions within a cell's context. Geneformer was trained on 30 million cells from humans and mice and has a smaller architecture with 40 million parameters [13] [4]. Its pretraining uses MGM with a Cross-Entropy (CE) loss for gene identity prediction.
UCE (Universal Cell Embedding): Distinguished by its use of protein language model embeddings from ESM-2 as gene representations, UCE is a massive model with 650 million parameters [14] [4]. It was pretrained on 36 million cells and uses a modified MGM task with a binary cross-entropy loss to predict whether a gene is expressed or not [4].

The following diagram summarizes the core pretraining workflow common to these models, highlighting key steps like tokenization and the masked gene modeling objective.

Performance Benchmarking Across Key Tasks

Independent benchmarks have revealed that no single model consistently dominates across all tasks. The table below summarizes the comparative performance of scFoundation against its peers in several critical applications.

Task	Top Performing Model(s)	scFoundation's Performance & Notes
Perturbation Response Prediction	Random Forest with GO features, Additive baseline model [5] [14]	Underperformed against a simple baseline that predicts the mean of training data [5] [14].
Drug Response Prediction	scFoundation (pooled-data), UCE (cross-data) [11]	Achieved top performance (mean F1: 0.971) when data is pooled; less dominant in cross-data settings [11].
Zero-Shot Cell Type Clustering	HVG selection, scVI, Harmony [15] [16]	Not among top performers; simpler methods like Highly Variable Genes (HVG) selection outperformed foundation models [15] [16].
Zero-Shot Batch Integration	HVG selection, scVI, Harmony [15] [16]	Not among top performers. Geneformer consistently ranked last, while scGPT showed mixed results [15] [16].
Gene Function Prediction	CellFM, scGPT [12] [17]	CellFM, a newer model, reported improvements. scGPT also showed strong capabilities [12] [17].

Deep Dive: Perturbation Prediction Benchmark

Predicting a cell's transcriptomic response to a genetic perturbation is a key test for a model's understanding of regulatory biology. Recent benchmarks have yielded critical insights.

Experimental Protocol [5] [14]:

Datasets: Models are evaluated on Perturb-seq datasets (e.g., Adamson, Norman, Replogle), which measure gene expression in single cells after CRISPR-based gene knockdown or overexpression.
Task: The model is fine-tuned on a set of seen perturbations and must predict the gene expression profile for unseen perturbations (Perturbation Exclusive, or PEX, setup).
Baselines: Performance is compared against deliberately simple models, including:
- Train Mean: Predicts the average expression profile from the training data.
- Additive Model: For double perturbations, predicts the sum of the individual logarithmic fold changes.
- Linear Models: Utilize pre-defined biological features like Gene Ontology (GO) vectors.
Evaluation: Predictions are compared to ground truth using metrics like Pearson correlation in the differential expression space (Pearson Delta) and L2 distance for the most highly expressed or differentially expressed genes.

Key Findings [5] [14]:

A Random Forest regressor using Gene Ontology (GO) features significantly outperformed both scFoundation and scGPT.
The simplest baseline, the Train Mean model, surprisingly surpassed the performance of the fine-tuned foundation models.
When the gene embeddings learned by scFoundation during pre-training were used as features in a Random Forest model, its performance improved, suggesting that the pre-training does capture some useful biological information, but the model's full architecture may not be leveraging it optimally for this task.

The logical flow of this benchmarking process is outlined below.

The Scientist's Toolkit

The following table details key resources and computational tools referenced in the benchmarking of single-cell foundation models.

Research Reagent / Resource	Function in Evaluation
Perturb-seq Datasets (Adamson, Norman, Replogle)	Provides ground-truth experimental data for benchmarking genetic perturbation prediction models [5] [14].
Gene Ontology (GO) Annotations	A source of biologically meaningful features used in simple baseline models (e.g., Random Forest) to compete against foundation models [5].
BioLLM Framework	A unified software framework that provides standardized APIs for integrating and evaluating different scFMs, ensuring fair comparisons [17].
Highly Variable Genes (HVG)	A simple, traditional feature selection method in single-cell analysis that serves as a strong baseline in zero-shot tasks like clustering and batch correction [15] [16].
Harmony & scVI	Established, specialized algorithms for single-cell data integration (batch correction) and analysis. Used as baseline benchmarks for cell-level tasks [15] [4] [16].

In conclusion, benchmarking studies reveal a nuanced picture of the current capabilities of scFoundation and its peers. While these models represent a significant architectural achievement, they do not consistently outperform simpler, often biologically-informed, baseline methods on critical tasks like perturbation prediction and zero-shot analysis [5] [14] [15]. This indicates that the goal of a generalized, out-of-the-box foundation model that fully captures the complexity of cellular biology remains an active challenge.

The choice of model is highly task-dependent. For instance, while scFoundation excelled in one drug response prediction benchmark [11], it was less competitive in perturbation prediction [5] [14]. The field is maturing with the development of standardized evaluation frameworks like BioLLM [17], which will be crucial for guiding future development. The path forward likely involves not only scaling model and dataset size but also more effectively integrating prior biological knowledge to build models that offer robust, generalizable, and biologically plausible predictions.

In single-cell RNA sequencing (scRNA-seq) data, genes do not possess a natural sequential order, unlike words in a sentence. This fundamental difference presents a significant challenge for applying transformer-based architectures, which were originally designed for sequentially ordered text. Treating a cell's gene expression profile as a "sentence" requires researchers to impose an artificial sequence, a process known as tokenization. How different foundation models approach this tokenization problem directly impacts their ability to capture biological relationships and predict cellular behavior.

This guide objectively compares the performance of two prominent single-cell foundation models—scGPT and scFoundation—within the broader context of benchmarking research. By examining their tokenization strategies, architectural implementations, and experimental outcomes, we provide researchers and drug development professionals with critical insights for model selection in biological discovery and therapeutic applications.

Model Architectures and Tokenization Strategies

Table 1: Fundamental Characteristics of scGPT and scFoundation

Characteristic	scGPT	scFoundation
Primary Architecture	GPT-like decoder	Asymmetric encoder-decoder
Model Parameters	~50 million	~100 million
Pretraining Dataset Size	~33 million cells	~50 million cells
Input Gene Capacity	1,200 highly variable genes (HVGs)	19,264 protein-encoding genes
Value Representation	Value binning	Value projection
Positional Embedding	Not used	Not used
Gene Symbol Embedding	Lookup Table (512 dimensions)	Lookup Table (768 dimensions)

Tokenization Approaches in Practice

Tokenization strategies differ markedly between models, significantly influencing their biological representations:

scGPT employs a highly variable gene selection approach, focusing computational resources on 1,200 genes with the most variable expression across cells. It uses value binning to transform continuous expression values into discrete tokens and does not incorporate positional embeddings, treating the gene set as a bag-of-words rather than an ordered sequence [4].
scFoundation utilizes a comprehensive gene representation, incorporating nearly all protein-encoding genes. This provides a more complete biological picture but increases computational complexity. Like scGPT, it foregoes positional embeddings, instead using a value projection system to handle continuous expression data [5] [4].

Benchmarking Performance on Perturbation Prediction

Experimental Protocols for Perturbation Prediction

Benchmarking studies employed standardized experimental protocols to evaluate model performance on predicting transcriptomic changes after genetic perturbations:

Datasets: Models were evaluated on multiple Perturb-seq datasets, including Adamson (68,603 cells with CRISPRi), Norman (91,205 cells with CRISPRa), and Replogle (K562 and RPE1 cell lines, ~162,000 cells each) [5].
Training Setup: Foundation models were fine-tuned according to authors' specifications using a perturbation-exclusive (PEX) setup, where models were evaluated on their ability to predict effects of completely unseen perturbations [5].
Baseline Models: Simple baseline models including Train Mean (predicting average expression from training data), Elastic-Net Regression, k-Nearest Neighbors, and Random Forest regressors were implemented for comparison [5].
Evaluation Metrics: Performance was assessed using Pearson correlation in differential expression space (Pearson Delta) and accuracy in predicting top 20 differentially expressed genes, with pseudo-bulk profiles created by averaging single-cell predictions [5].

Quantitative Performance Comparison

Table 2: Performance Comparison on Perturbation Prediction (Pearson Delta)

Model	Adamson	Norman	Replogle K562	Replogle RPE1
Train Mean (Baseline)	0.711	0.557	0.373	0.628
scGPT	0.641	0.554	0.327	0.596
scFoundation	0.552	0.459	0.269	0.471
Random Forest + GO	0.739	0.586	0.480	0.648
Random Forest + scGPT Embeddings	0.727	0.583	0.421	0.635

Performance Gap: Simple baseline models consistently outperformed both foundation models across all datasets. The Train Mean baseline achieved superior Pearson Delta correlation values (0.711, 0.557, 0.373, 0.628) compared to scGPT (0.641, 0.554, 0.327, 0.596) and scFoundation (0.552, 0.459, 0.269, 0.471) across the four benchmark datasets respectively [5].
Biological Prior Knowledge Integration: Random Forest models incorporating Gene Ontology (GO) features substantially outperformed all foundation models (0.739, 0.586, 0.480, 0.648), suggesting that explicit biological knowledge may be more valuable than representations learned through pretraining [5].
Embedding Utility: When scGPT's pretrained gene embeddings were used as features in Random Forest models, performance improved over the fine-tuned scGPT model itself (0.727 vs. 0.641 on Adamson dataset), indicating that the embeddings capture biologically relevant information that may be underutilized by scGPT's native architecture [5].

Zero-Shot Performance and Batch Integration

Experimental Protocols for Zero-Shot Evaluation

Evaluation Setting: Models were assessed without any task-specific fine-tuning to measure the generalizable biological knowledge acquired during pretraining [15].
Tasks: Cell type clustering and batch integration across multiple datasets including Tabula Sapiens, Pancreas, and PBMC datasets [15].
Baselines: Compared against established methods including Highly Variable Genes (HVG) selection, Harmony, and scVI [15].
Metrics: Average BIO score (AvgBio) for clustering quality and batch integration metrics (PCR) for technical variation removal [15].

Zero-Shot Performance Results

Table 3: Zero-Shot Performance Comparison Across Tasks

Model	Cell Type Clustering (AvgBio)	Batch Integration (PCR)	Generalization to Unseen Data
scGPT	Variable, outperformed by HVG and scVI on most datasets	Moderate, outperforms Harmony and scVI on some complex datasets	Inconsistent, no clear advantage over baselines
scFoundation	Not fully evaluated in zero-shot setting	Not fully evaluated in zero-shot setting	Limited evaluation available
Geneformer	Consistently outperformed by all baselines	Poor, shows inadequate batch mixing	Fails to generalize effectively
HVG (Baseline)	Best performing across most datasets	Best batch integration scores	Consistent performance across datasets

Cell Type Clustering: In zero-shot settings, both scGPT and Geneformer were generally outperformed by simple Highly Variable Genes selection and established methods like Harmony and scVI across multiple datasets. HVG achieved the best clustering performance, indicating that foundation models do not necessarily provide superior cell embeddings without fine-tuning [15].
Batch Integration: scGPT showed mixed results, outperforming Harmony and scVI on complex datasets with both technical and biological batch effects, but underperforming on datasets with purely technical variation. Geneformer consistently ranked last in batch integration capabilities, with its embeddings often showing higher batch effect retention than the original data [15].
Pretraining Impact: Evaluation of different scGPT variants (random, kidney-specific, blood-specific, human) demonstrated that pretraining provides clear improvements over random initialization, but larger and more diverse pretraining datasets do not consistently confer additional benefits, suggesting diminishing returns to scale [15].

The Scientist's Toolkit: Essential Research Reagents

Table 4: Key Computational Tools and Resources for scFM Research

Resource	Type	Primary Function	Relevance to Tokenization
Perturb-seq Datasets	Experimental Data	Provides ground truth for perturbation effects	Enables evaluation of tokenization strategies on functional outcomes
Gene Ontology (GO)	Biological Database	Structured knowledge base of gene functions	Provides biological priors for comparison with learned representations
CZ CELLxGENE	Data Platform	Standardized access to >100M single cells	Pretraining resource for developing tokenization approaches
HVG Selection	Computational Method	Identifies genes with high variability	Basis for scGPT's token reduction strategy
Random Forest Regression	Machine Learning Model	Baseline for prediction tasks	Tests biological relevance of gene embeddings independent of transformer architecture
Pearson Delta Metric	Evaluation Metric	Correlates predicted vs. actual differential expression	Quantifies performance of different tokenization schemes

The benchmarking data reveals several critical considerations for researchers and drug development professionals:

Simplicity Versus Complexity: Simple baseline models consistently match or outperform sophisticated foundation models in perturbation prediction tasks. The "Train Mean" baseline surprisingly exceeded both scGPT and scFoundation performance, suggesting that current foundation models may not be capturing meaningful perturbation-specific signals beyond basic averaging approaches [5] [14].
Tokenization Impact: The choice of tokenization strategy significantly influences model performance. scGPT's focused approach using 1,200 highly variable genes demonstrates that careful gene selection may be more important than comprehensive gene inclusion, as implemented in scFoundation with 19,264 genes [5] [4].
Embedding Utility Versus Architecture: The strong performance of Random Forest models using scGPT's embeddings suggests that the pretrained gene representations capture biologically meaningful information, but this potential may not be fully leveraged within the transformer architecture itself [5].
Zero-Shot Limitations: Both scGPT and Geneformer show inconsistent zero-shot performance, indicating that their pretraining objectives may not optimally align with downstream biological tasks without fine-tuning [15].

For researchers selecting models for drug development applications, these findings suggest that foundation models should be evaluated against simple baselines specific to each use case. While scGPT and scFoundation represent significant engineering achievements, their practical advantage over simpler, more interpretable methods remains uncertain for critical applications like perturbation prediction. Future development should focus on better alignment between tokenization strategies, pretraining objectives, and biologically meaningful outcomes.

In the development of single-cell foundation models (scFMs) like scGPT and scFoundation, the choice of pretraining data is a fundamental determinant of model performance. These models are trained on vast collections of single-cell data to learn the "language of cells," with the goal of generating accurate predictions for downstream tasks, such as forecasting cellular responses to genetic perturbations [1]. However, recent rigorous benchmarks reveal a surprising trend: these complex models often fail to outperform simple baseline methods on key predictive tasks [18] [14]. This guide objectively compares the major data sources and examines the experimental evidence benchmarking the performance of models built upon them.

Single-cell foundation models rely on large-scale, curated data repositories for pretraining. The table below summarizes the key characteristics of the primary data sources available.

Atlas Name	# Cells	Lead Organization	# Species	Primary URL
CZ CELLxGENE Discover	112.8 M	Chan Zuckerberg Initiative (CZI)	7	https://cellxgene.cziscience.com/
Human Cell Atlas (HCA)	65.4 M	HCA Consortium	1	https://data.humancellatlas.org/
DISCO	125.6 M	Singapore Immunology Network	1	https://www.immunesinglecell.org
Single Cell Portal	57.6 M	Broad Institute	18	https://singlecell.broadinstitute.org/
Single Cell Expression Atlas	13.5 M	EMBL-EBI	21	https://www.ebi.ac.uk/gxa/sc/home
Allen Brain Cell Atlas	4.0 M	Allen Institute	1	https://portal.brain-map.org/

Source: Adapted from PMC[citiation:7]. Note: Cell counts are approximate and as of the time of writing.

Platforms like CZ CELLxGENE provide unified access to millions of annotated single-cell datasets, serving as a cornerstone for the scFM ecosystem [1] [19]. The Human Cell Atlas (HCA) is another monumental project that aggregates data from thousands of studies, regularly updating its portal with new and updated projects [20]. Public repositories such as the Gene Expression Omnibus (GEO) and the Sequence Read Archive (SRA) host thousands of individual studies, which researchers then integrate into large training corpora [1].

Benchmarking Performance: scGPT vs. scFoundation

Despite their sophisticated architecture and pretraining on massive datasets, both scGPT and scFoundation have been shown to underperform compared to simpler models in predicting post-perturbation gene expression. The following table summarizes key quantitative results from independent benchmarks.

Model / Baseline	Performance Summary (Pearson Delta, Differential Expression)	Key Benchmarking Finding
scGPT	Adamson: 0.641, Norman: 0.554, Replogle K562: 0.327, Replogle RPE1: 0.596 [18]	Underperformed versus simple mean baseline and random forest models.
scFoundation	Adamson: 0.552, Norman: 0.459, Replogle K562: 0.269, Replogle RPE1: 0.471 [18]	Underperformed versus simple mean baseline and random forest models.
Train Mean (Baseline)	Adamson: 0.711, Norman: 0.557, Replogle K562: 0.373, Replogle RPE1: 0.628 [18]	The simplest baseline, which predicts the average expression from training data, outperformed both foundation models.
Random Forest + GO Features	Adamson: 0.739, Norman: 0.586, Replogle K562: 0.480, Replogle RPE1: 0.648 [18]	Outperformed foundation models by a large margin by using biologically meaningful features.
Additive Model (Baseline)	Outperformed all deep learning models in predicting double perturbation effects [14].	A simple baseline that sums the effects of single perturbations was not beaten by any complex model.
Linear Model with Pretrained Embeddings	Performed as well as or better than scGPT and GEARS with their built-in decoders [14].	Using embeddings from scFMs in a simple linear model was more effective than using the models' own complex architectures.

A study published in Nature Methods (2025) reached a similar stark conclusion, finding that no deep-learning model could consistently outperform the simple mean prediction or a linear model in predicting the effects of unseen single-gene perturbations [14]. Furthermore, in predicting double-gene perturbations, even the simplistic "additive" baseline model, which sums the effects of two single perturbations, proved superior to all foundation models [14].

Experimental Protocols in Benchmarking Studies

To ensure fair comparisons, independent benchmarks have employed rigorous and consistent methodologies.

Datasets: Benchmarks typically use well-established Perturb-seq datasets, including:
- Adamson et al.: 68,603 single cells with single-guide CRISPRi perturbations in K562 cells [18] [14].
- Norman et al.: 91,205 single cells with single and dual CRISPRa (overexpression) perturbations in K562 cells [18] [14].
- Replogle et al.: Over 160,000 single cells each from genome-wide CRISPRi screens in K562 and RPE1 cell lines [18] [14].
Evaluation Metrics:
- Pearson Delta: The Pearson correlation between the predicted and ground truth differential expression profiles (perturbed vs. control). This is considered more meaningful than correlation in raw expression space [18].
- L2 Distance: The Euclidean distance between predicted and observed expression values for the top highly expressed or differentially expressed genes [14].
Benchmarking Setup: Models are evaluated in a Perturbation Exclusive (PEX) setting, where their ability to generalize to unseen perturbations is tested. Models are fine-tuned on a set of perturbations and then evaluated on a held-out set [18] [14].

Diagram of the scFM pretraining and benchmarking workflow.

The Scientist's Toolkit: Key Research Reagents & Solutions

The following table details essential resources and their functions in this field, from data portals to computational tools.

Resource Name	Type	Primary Function in Research
CELLxGENE Discover	Data Portal	Provides unified access to over 100 million curated single-cells for discovery and analysis [1] [19].
HCA Data Portal	Data Portal	Centralized platform to explore and download data from the Human Cell Atlas project [20].
Perturb-seq	Experimental Technology	Combines CRISPR-based genetic perturbations with single-cell RNA sequencing to generate ground-truth data for benchmarking [18].
Gene Ontology (GO)	Knowledge Base	Provides structured biological knowledge features (e.g., functional annotations) that can be used to build highly predictive baseline models [18].
Random Forest Regressor	Computational Model	A classic machine learning algorithm that, when provided with GO features, has been shown to outperform complex foundation models [18].
Linear Model with Embeddings	Computational Model	A simple model that uses pretrained gene embeddings from scFMs as input, often outperforming the original complex models [14].

In conclusion, while data sources like CELLxGENE and the Human Cell Atlas are invaluable for pretraining scFMs, current evidence indicates that the models themselves may not yet be leveraging this data effectively for perturbation prediction. Researchers should consider these benchmarking results and the power of simple, biologically-informed baselines when designing and evaluating their own studies.

Practical Applications: From Drug Response to Perturbation Prediction

The accurate prediction of drug response is a critical challenge in modern oncology, directly impacting the development of effective cancer therapies and the understanding of drug resistance mechanisms. Single-cell RNA sequencing (scRNA-seq) technology has emerged as a powerful tool for characterizing the cellular heterogeneity that underpins varying treatment outcomes [21]. Recently, large-scale foundation models pre-trained on massive biological datasets have shown potential for enhancing single-cell analysis. This guide provides an objective performance comparison of two prominent foundation models—scGPT and scFoundation—within the scDrugMap benchmarking framework, offering researchers evidence-based insights for model selection in drug response prediction tasks.

The scDrugMap framework represents the first comprehensive benchmark for evaluating large foundation models on drug response prediction using single-cell data. It incorporates a curated resource of over 326,000 cells from 36 datasets across 23 studies, spanning diverse cancer types, tissues, and treatment regimens [11] [22] [21]. The framework evaluates models under two distinct scenarios—pooled-data evaluation and cross-data evaluation—implementing both layer freezing and Low-Rank Adaptation (LoRA) fine-tuning strategies [21].

Table 1: Overall Performance Comparison of scGPT and scFoundation in scDrugMap Benchmark

Model	Pooled-Data Evaluation (F1 Score)	Cross-Data Evaluation (F1 Score)	Key Strengths
scFoundation	0.971 (layer freezing)0.947 (fine-tuning)	Not reported	Excels in pooled-data scenarios with extensive training data
scGPT	Not best performer	0.858 (zero-shot)	Superior cross-dataset generalization with zero-shot learning
UCE	Not best performer	0.774 (fine-tuning on tumor tissue)	Strong performance after fine-tuning on specific tissue types

Table 2: Detailed Performance Metrics Across Evaluation Settings

Evaluation Scenario	Training Strategy	scFoundation Performance	scGPT Performance	Top Performing Model
Pooled-Data	Layer Freezing	0.971 (F1)	Lower than scFoundation	scFoundation
Pooled-Data	LoRA Fine-tuning	0.947 (F1)	Lower than scFoundation	scFoundation
Cross-Data	Zero-Shot Learning	Lower than scGPT	0.858 (F1)	scGPT
Cross-Data	Fine-tuning	Not best performer	Lower than UCE	UCE (0.774 F1)

Comparative Analysis of Model Performance

Pooled-Data Evaluation

In the pooled-data evaluation scenario, where models are trained and tested on aggregated data from multiple studies, scFoundation demonstrated superior performance compared to all other models, including scGPT [11] [21]. scFoundation achieved the highest mean F1 scores of 0.971 with layer freezing and 0.947 with fine-tuning, outperforming the lowest-performing model by over 50% [21]. This indicates that scFoundation excels in contexts where substantial training data from multiple sources is available, effectively leveraging its pre-training on large-scale single-cell data.

Cross-Data Evaluation

In cross-data evaluation, where models are tested independently on datasets from individual studies to assess generalization capabilities, scGPT demonstrated superior performance in zero-shot learning with a mean F1 score of 0.858 [21]. This highlights scGPT's stronger generalization to unseen data distributions without additional training. After fine-tuning on tumor tissue, UCE achieved the highest performance (mean F1: 0.774) in this setting [21], suggesting that model performance is highly dependent on both the base architecture and the adaptation strategy.

Critical Perspectives on Foundation Model Performance

Independent benchmarking studies beyond scDrugMap have revealed important limitations in current foundation models for biological prediction tasks. Research published in Nature Methods found that neither scGPT nor scFoundation outperformed deliberately simple baselines for predicting genetic perturbation effects [14]. Simple models—including taking the mean of training examples or using basic machine learning models with biologically meaningful features—often outperformed these foundation models by a substantial margin [5] [14].

Similarly, zero-shot evaluations published in Genome Biology demonstrated that both scGPT and Geneformer underperform simpler methods like highly variable gene selection and established integration tools (Harmony, scVI) on tasks including cell type clustering and batch integration [16] [15]. These findings highlight that while foundation models show promise, their practical utility for drug response prediction requires careful validation against simpler alternatives.

scDrugMap Experimental Framework

Datasets and Curation

The scDrugMap framework incorporates a primary collection of 326,751 single tumor cells from 36 scRNA-seq datasets across 23 studies, with representation of 14 cancer types, 3 therapy categories (targeted therapy, chemotherapy, immunotherapy), and multiple tissue types (cell lines, bone marrow aspirates, tumor tissue, PBMCs) [21]. An independent validation collection includes 18,856 cells from 17 datasets across 6 studies [21]. This comprehensive coverage ensures robust benchmarking across diverse biological contexts.

Model Training and Adaptation Strategies

scDrugMap implements two primary adaptation strategies for foundation models:

Layer Freezing: The pre-trained model weights are kept fixed while only task-specific heads are trained, preserving the knowledge acquired during pre-training.
Low-Rank Adaptation (LoRA): A parameter-efficient fine-tuning method that introduces trainable low-rank matrices into the model architecture, enabling effective adaptation with minimal computational overhead [21].

Evaluation Metrics and Protocol

The benchmarking protocol employs the F1 score as the primary metric, providing a balanced measure of precision and recall for drug response prediction [21]. The evaluation follows rigorous data splitting strategies appropriate for each scenario, with cross-validation in pooled-data settings and leave-one-dataset-out validation in cross-data settings to ensure reliable performance estimation.

scDrugMap Benchmarking Workflow

Research Reagent Solutions

Table 3: Essential Research Resources for scDrugMap-Style Benchmarking

Resource Category	Specific Examples	Function in Research
Single-Cell Foundation Models	scFoundation, scGPT, UCE, scBERT, Geneformer, cellLM, cellPLM	Base pre-trained models for transfer learning and zero-shot evaluation
Large Language Models	LLaMa3-8B, GPT4o-mini	General-purpose models adaptable for biological sequence analysis
Training Adaptation Methods	Low-Rank Adaptation (LoRA), Layer Freezing	Parameter-efficient fine-tuning strategies for model specialization
Computational Frameworks	scDrugMap (Python CLI & Web Server), BioLLM	Standardized interfaces for model integration and evaluation
Benchmark Datasets	Primary Collection (326,751 cells), Validation Collection (18,856 cells)	Curated single-cell data with drug response annotations for training and testing
Evaluation Metrics	F1 Score, Pearson Correlation, Differential Expression Analysis	Quantitative performance assessment for model comparison

The scDrugMap benchmarking framework provides comprehensive evidence that both scFoundation and scGPT offer distinct strengths for drug response prediction, with the optimal choice dependent on the specific research context and application requirements. scFoundation demonstrates superior performance in pooled-data scenarios where substantial training data is available, while scGPT excels in cross-data evaluation with stronger zero-shot generalization capabilities. However, independent studies consistently show that simpler models can sometimes outperform these foundation models, highlighting the importance of rigorous benchmarking against appropriate baselines. Researchers should select models based on their specific use case, data availability, and generalization requirements, while remaining critical of model claims and validating performance against simpler alternatives.

Predicting Cellular Responses to Genetic Perturbations with Perturb-Seq Data

The ability to accurately predict cellular responses to genetic perturbations is a cornerstone of functional genomics and therapeutic discovery. Technologies like Perturb-seq, which combines CRISPR-based interventions with single-cell RNA sequencing, have generated vast datasets detailing these responses. In response, the computational biology community has developed sophisticated "foundation" models, pre-trained on millions of single-cell transcriptomes, to tackle this prediction problem. Two prominent models, scGPT and scFoundation, have emerged as state-of-the-art candidates. However, rigorous and independent benchmarking is crucial to validate their performance claims. This guide synthesizes evidence from recent, comprehensive studies to objectively compare the predictive performance of these foundation models against each other and, importantly, against simpler baseline approaches. The overarching finding across multiple independent investigations is that despite their complexity and computational cost, these foundation models currently fail to consistently outperform deliberately simple baselines, highlighting significant challenges and opportunities for improvement in the field.

Performance Comparison: Foundation Models vs. Baseline Approaches

Recent benchmark studies have systematically evaluated scGPT and scFoundation against a range of simpler models on the task of predicting post-perturbation gene expression profiles. The consistent result is that foundation models are often outperformed by straightforward alternatives.

Table 1: Performance Comparison on Perturbation Prediction Tasks (Pearson Correlation in Differential Expression Space)

Model / Dataset	Adamson et al.	Norman et al.	Replogle (K562)	Replogle (RPE1)
Train Mean (Baseline)	0.711	0.557	0.373	0.628
scGPT	0.641	0.554	0.327	0.596
scFoundation	0.552	0.459	0.269	0.471
Random Forest (GO Features)	0.739	0.586	0.480	0.648

Data Source: [5]

The data reveals that the simplest baseline, which predicts the average expression profile from the training data ("Train Mean"), reliably outperforms both scGPT and scFoundation across multiple datasets [5]. Even more notably, a standard Random Forest model using Gene Ontology (GO) biological pathway features as input "outperformed foundation models by a large margin" [5]. This suggests that incorporating structured biological prior knowledge can be more effective than the representations learned through the foundation models' pre-training on vast amounts of single-cell data.

In a separate study published in Nature Methods, a simple additive model—which sums the individual logarithmic fold changes of single perturbations to predict the effect of a double perturbation—proved superior to five foundation models and two other deep learning approaches [14]. Furthermore, when tasked with predicting genetic interactions (where the effect of a double perturbation is non-additive), none of the deep learning models performed better than a "no change" baseline that always predicts the control condition [14].

Detailed Experimental Protocols for Benchmarking

Understanding the methodology behind these benchmarks is critical for interpreting the results and for researchers aiming to conduct their own evaluations.

The benchmarks rely on publicly available Perturb-seq datasets, which use CRISPR to perturb genes and single-cell RNA sequencing to measure the transcriptional outcome. Key datasets include:

Adamson et al.: 68,603 single K562 cells with single-gene CRISPRi perturbations [5] [14].
Norman et al.: 91,205 single K562 cells with single and dual-gene CRISPRa (overexpression) perturbations [5] [14].
Replogle et al. (K562 & RPE1): Two datasets each with over 160,000 single cells from genome-wide CRISPRi screens in different cell lines [5] [14].

For evaluation, single-cell predictions are typically aggregated by perturbation to create pseudo-bulk expression profiles, which are compared to the ground truth pseudo-bulk profiles [5].

Evaluation Metrics and Task Formulation

The core evaluation metric is often the Pearson correlation, calculated in two key spaces:

Differential Expression Space (Pearson Delta): The correlation between the predicted and actual change in gene expression (perturbed vs. control). This is considered more informative than raw expression space, as it focuses on the perturbation-specific effect [5].
Performance on Top DE Genes: The correlation is also calculated specifically for the top 20 differentially expressed (DE) genes to assess the model's ability to capture the most significant changes [5].

The primary task is Perturbation Exclusive (PEX) prediction, where the model's ability to generalize to the effects of completely unseen perturbations or combinations is tested [5] [23].

Model Training and Fine-tuning

For the foundation models, the standard protocol involves taking a model that has been pre-trained on a large corpus of single-cell data (often >10 million cells) and then fine-tuning it on the specific perturbation dataset of interest. The baseline models, such as Random Forest or k-Nearest Neighbors, are trained directly on the perturbation data using features derived from biological databases or the foundation models' own gene embeddings [5].

Visualizing the Benchmarking Workflow

The following diagram illustrates the standard workflow for training and evaluating perturbation response prediction models, as used in the cited benchmarks.

Diagram 1: Benchmarking Workflow for Perturbation Prediction Models. This workflow compares foundation models (fine-tuned on Perturb-seq data) against baseline models trained directly on the data with biological features. Performance is evaluated by comparing predictions to the experimental ground truth.

Successful perturbation modeling relies on a combination of computational tools and curated biological datasets. The table below details essential "research reagents" for this field.

Table 2: Essential Research Reagents for Perturbation Modeling

Resource Name	Type	Primary Function in Perturbation Modeling
Perturb-seq Datasets (Adamson, Norman, Replogle)	Experimental Data	Provides the ground truth data of gene expression responses to genetic perturbations, used for model training and benchmarking [5] [14].
Gene Ontology (GO) / KEGG/ REACTOME	Biological Database	Curated knowledge bases of biological pathways and functions. Used as informative features for baseline machine learning models [5].
CollecTRI	Biological Database	A comprehensive gene regulatory network resource. Used to evaluate the biological meaningfulness of learned gene embeddings [5].
PerturBench	Computational Framework	A modular codebase for standardized development and evaluation of perturbation prediction models, ensuring fair comparisons [23].
BioLLM	Computational Framework	A unified interface that integrates diverse single-cell foundation models (scGPT, Geneformer, scFoundation), streamlining their application and evaluation [17].

The independent benchmarking of scGPT and scFoundation reveals a critical and consistent finding: as of early 2025, these complex foundation models do not surpass the performance of simple baseline methods for predicting cellular responses to genetic perturbations. Models that predict the average training response or use off-the-shelf biological features in a Random Forest regressor consistently set a high bar. This does not negate the potential of the foundation model approach but underscores that the field is still in its early stages. Future progress will likely depend on improved model architectures, more effective pre-training strategies, and the development of benchmarking standards that more accurately reflect the complex biological reality of perturbation responses. For researchers and drug developers, the current evidence strongly suggests that simpler, interpretable models should be included as robust baselines in any project aiming to predict genetic perturbation effects.

The advent of single-cell foundation models (scFMs) represents a paradigm shift in computational biology, offering unprecedented opportunities to advance cancer drug response prediction (DRP). Models like scGPT and scFoundation, built on transformer architectures pretrained on millions of single cells, promise to capture universal biological principles that can be specialized for downstream tasks like DRP [1]. These models employ sophisticated tokenization strategies where genes become input tokens analogous to words in a sentence, with expression values providing additional context [4]. The fundamental premise is that exposure to diverse cellular states across tissues and conditions enables these models to learn generalized representations of cellular behavior that can enhance predictive accuracy for specific applications like DeepCDR.

However, integrating these powerful models into existing DRP pipelines requires careful benchmarking to identify their relative strengths, limitations, and optimal implementation strategies. Recent comprehensive evaluations reveal a complex performance landscape where scFMs demonstrate significant potential but also notable limitations compared to simpler approaches [5] [14] [4]. This comparison guide provides an objective assessment of scGPT versus scFoundation performance to inform effective integration with DeepCDR frameworks, supported by experimental data and implementation protocols.

Performance Benchmarking: Quantitative Comparison

Perturbation Response Prediction

Accurately predicting cellular responses to genetic and chemical perturbations is fundamental to DRP. Benchmarking studies directly compared scGPT and scFoundation against baseline models for predicting transcriptome changes after single and double genetic perturbations using Perturb-seq datasets (Adamson, Norman, and Replogle) [5] [14].

Table 1: Performance Comparison in Perturbation Prediction (Pearson Delta Metric)

Model	Adamson Dataset	Norman Dataset	Replogle K562	Replogle RPE1
Train Mean (Baseline)	0.711	0.557	0.373	0.628
scGPT	0.641	0.554	0.327	0.596
scFoundation	0.552	0.459	0.269	0.471
Random Forest with GO	0.739	0.586	0.480	0.648

Surprisingly, even simple baseline models like Train Mean (predicting the average of training examples) outperformed both foundation models across all datasets [5]. Similarly, a linear baseline model that sums individual logarithmic fold changes for double perturbations substantially outperformed scGPT, scFoundation, and other deep learning models [14]. Random Forest models incorporating biological prior knowledge (Gene Ontology features) achieved the best performance, surpassing scGPT by a large margin [5].

Zero-Shot Capabilities and Generalizability

For practical implementation in DRP pipelines, zero-shot performance (without task-specific fine-tuning) is crucial for exploratory applications where labeled data is limited. Evaluation of zero-shot capabilities for cell type annotation and batch integration revealed important limitations:

Table 2: Zero-Shot Performance Across Biological Tasks

Model	Cell Type Annotation (AvgBIO)	Batch Integration	Biological Relevance
scGPT	Inconsistent; outperformed by scVI and Harmony on most datasets	Moderate technical batch correction; struggles with biological variation	Captures some biological pathways
Geneformer	Consistently outperformed by simple HVG selection	Poor performance; embeddings often dominated by batch effects	Limited biological relevance in embeddings
scFoundation	Not extensively evaluated in zero-shot	Not extensively evaluated in zero-shot	Gene embeddings show biological utility

In zero-shot cell type clustering, both scGPT and Geneformer were consistently outperformed by established methods like Harmony, scVI, and even simple highly variable genes (HVG) selection [15]. Notably, selecting HVGs achieved the best batch integration scores across all datasets, highlighting the performance gap for foundation models in zero-shot settings [15].

Experimental Protocols and Methodologies

Benchmarking Framework Design

To ensure reproducible evaluation of scFMs for DRP applications, researchers should implement standardized benchmarking protocols mirroring recent comprehensive studies:

Data Preparation and Partitioning:

Utilize standardized Perturb-seq datasets (Adamson, Norman, Replogle) for genetic perturbation prediction [5] [14]
Implement rigorous train-test splits focusing on perturbation-exclusive (PEX) settings where models predict effects of completely unseen perturbations [5]
For drug response prediction, employ GDSC and CCLE datasets with multiple splitting strategies (mask-cells, mask-drugs, mask-pairs) to assess generalizability [24]

Evaluation Metrics:

Primary: Pearson correlation in differential expression space (Pearson Delta) focusing on top differentially expressed genes [5]
Secondary: L2 distance for most highly expressed genes, genetic interaction detection capability [14]
Additional: Batch integration metrics (ASW, BIO scores) for zero-shot evaluation [15]

Baseline Models:

Include simple baselines (Train Mean, No Change, Additive Model) [5] [14]
Implement traditional machine learning models (Random Forest with biological features) [5]
Compare against specialized DRP models (GraphTCDR, SubCDR) [25] [26]

Model Integration Strategies

Feature Extraction Approach: Rather than using scFMs as end-to-end predictors, extract gene and cell embeddings as features for traditional machine learning models. Random Forest models using scGPT embeddings achieved better performance than fine-tuned scGPT itself, though still underperforming compared to biological prior knowledge features [5].

Hybrid Prediction Framework: Implement ensemble approaches combining scFM embeddings with biological knowledge features. Studies show that incorporating Gene Ontology vectors and pathway information significantly boosts prediction accuracy compared to using foundation model outputs alone [5] [4].

Diagram 1: Enhanced DeepCDR Integration Framework. This workflow combines foundation model embeddings with traditional machine learning and biological prior knowledge for improved drug response prediction.

Technical Specifications and Implementation

Architectural Comparison

Understanding the fundamental architectural differences between scGPT and scFoundation is essential for effective integration:

Table 3: Model Architectures and Training Specifications

Parameter	scGPT	scFoundation
Architecture	GPT-style decoder with unidirectional attention	BERT-style encoder with bidirectional attention
Parameters	~50 million	~100 million
Pretraining Data	33 million non-cancerous human cells	50 million single cells
Input Genes	1,200 highly variable genes	19,264 protein-encoding genes
Tokenization	Value binning combined with gene embeddings	Gene embeddings with value projection
Positional Encoding	Not used	Not used
Primary Pretraining Task	Iterative masked gene modeling with MSE loss	Read-depth-aware masked gene modeling

scGPT employs a GPT-style decoder architecture pretrained on 33 million non-cancerous human cells, using value binning for expression levels and focusing on highly variable genes [4] [1]. In contrast, scFoundation utilizes a BERT-style encoder trained on 50 million cells with nearly complete gene coverage, implementing read-depth-aware masking during pretraining [4].

Research Reagent Solutions

Table 4: Essential Research Resources for scFM Integration

Resource	Type	Function in DRP Research
GDSC Database	Drug screening dataset	Primary source of drug response data (IC50 values) for model training and validation
CCLE	Cell line database	Provides multi-omics profiles of cancer cell lines for feature generation
Perturb-seq Datasets	Genetic perturbation data	Enables model benchmarking for perturbation response prediction
PubChem	Chemical database	Source of drug molecular representations (fingerprints, SMILES strings)
Gene Ontology	Biological knowledge base	Provides prior knowledge features for enhancing prediction accuracy
BioLLM Framework	Software framework	Standardized APIs for integrating and evaluating multiple scFMs

Integration Recommendations for DeepCDR

Practical Implementation Guidelines

Based on comprehensive benchmarking evidence, the following integration approaches are recommended:

Prioritize Feature Extraction over End-to-End Learning: Instead of using scFMs as complete DRP solutions, extract their gene and cell embeddings as input features for established DeepCDR architectures. Experimental results demonstrate that Random Forest models using scGPT embeddings outperformed fine-tuned scGPT while maintaining computational efficiency [5].

Implement Ensemble Strategies: Combine foundation model outputs with biological prior knowledge. Studies consistently show that models incorporating Gene Ontology features and pathway information achieve superior performance compared to standalone scFM predictions [5] [25].

Leverage scGPT for Blood-Derived Cancers: scGPT demonstrates stronger performance on blood and bone marrow datasets compared to other tissue types, suggesting prioritized integration for hematological malignancies [15].

Utilize scFoundation for Comprehensive Gene Coverage: When full transcriptome analysis is required, scFoundation's coverage of 19,264 protein-encoding genes provides advantages over scGPT's highly-variable-gene approach [4].

Limitations and Alternative Approaches

Despite their theoretical promise, current scFMs show consistent limitations that warrant consideration:

Simplicity-Performance Paradox: Across multiple benchmarks, simple baseline models consistently matched or outperformed sophisticated foundation models. The "additive model" for genetic interactions and "train mean" for perturbation response provided competitive baselines [5] [14].

Specialized DRP Model Superiority: Models specifically designed for drug response prediction, such as GraphTCDR (utilizing heterogeneous graph neural networks) and SubCDR (employing subcomponent-guided deep learning), demonstrated superior performance compared to general-purpose scFMs [25] [26].

Computational Efficiency Trade-offs: The substantial computational resources required for scFM fine-tuning may not be justified given their current performance limitations, especially when simpler models achieve comparable or better results [14].

Diagram 2: Model Selection Framework for DRP. This decision flow prioritizes simpler, biologically-informed approaches based on benchmarking evidence, with foundation models reserved for specialized cases.

Integration of single-cell foundation models with DeepCDR frameworks offers promising avenues for enhancing cancer drug response prediction, but requires careful, evidence-based implementation. Current benchmarking reveals that while scGPT and scFoundation provide valuable biological representations, they rarely outperform simpler approaches as end-to-end solutions and show significant limitations in zero-shot settings.

For immediate DeepCDR enhancement, a hybrid approach leveraging scGPT embeddings as input features to traditional machine learning models, augmented with biological prior knowledge, represents the most promising integration path. This strategy combines the representation learning capabilities of foundation models with the proven predictive power of established DRP methodologies. As the scFM field rapidly evolves, continued rigorous benchmarking against simple baselines remains essential to distinguish genuine algorithmic advances from incremental improvements that fail to translate to practical predictive performance.

Within the rapidly evolving field of single-cell biology, foundation models pretrained on millions of cells promise to serve as versatile tools for a wide array of downstream tasks. The "pre-train then fine-tune" paradigm aims to capture universal patterns of gene regulation and cell behavior, which can then be efficiently adapted to specific applications. This guide provides an objective, data-driven comparison of two prominent foundation models—scGPT and scFoundation—focusing on their performance across three critical tasks: cell type annotation, batch correction, and gene network inference. The analysis is framed within the broader context of benchmarking studies that seek to evaluate whether these complex models deliver tangible advantages over simpler, more established computational methods. The findings summarized here are based on recent peer-reviewed literature and preprints that have conducted rigorous, multi-faceted benchmarks.

Performance Comparison Tables

The following tables summarize the quantitative performance of scGPT and scFoundation against various baseline models across different tasks. The data is aggregated from multiple large-scale benchmarking studies.

Table 1: Performance on Perturbation Effect Prediction (Differential Expression Space)

Model	Adamson Dataset (Pearson Delta)	Norman Dataset (Pearson Delta)	Replogle K562 (Pearson Delta)	Replogle RPE1 (Pearson Delta)
Train Mean (Simplest Baseline)	0.711	0.557	0.373	0.628
scGPT	0.641	0.554	0.327	0.596
scFoundation	0.552	0.459	0.269	0.471
Random Forest with GO Features	0.739	0.586	0.480	0.648

Table 2: Zero-Shot Performance on Cell-Level Tasks [4] [15]

Model	Cell Type Annotation (AvgBio Score)	Batch Integration (iLISI Score)	Computational Resources
scGPT	Variable; outperformed by scVI and Harmony on most datasets [15]	Good on complex datasets with biological batch effects [15]	50 M parameters [4]
scFoundation	Information missing	Information missing	100 M parameters [4]
Geneformer	Consistently outperformed by simpler baselines [15]	Poor; often worsened batch effects [15]	40 M parameters [4]
Baseline: scVI / Harmony	Consistently high performance [15]	Consistently high performance [15]	Lower resource requirements

Table 3: Performance on Gene Network Inference [27]

Model Category	Representative Methods	Precision	Recall	Leverages Interventional Data
Observational Methods	PC, GES, NOTEARS	Low to Moderate	Low to Moderate	No
Interventional Methods	GIES, DCDI	Low to Moderate	Low to Moderate	Yes, but with limited benefit
Challenge-Winning Methods	Mean Difference, Guanlab	High	High	Yes
Foundation Models	scGPT, scFoundation	Not consistently top-ranked [27]	Not consistently top-ranked [27]	Information missing

Experimental Protocols

To ensure the reproducibility of the results and a fair understanding of the comparisons, this section outlines the key experimental methodologies shared across the cited benchmarking studies.

Benchmarking Perturbation Prediction

1. Datasets: Benchmarks primarily used Perturb-seq datasets, including:

Adamson et al.: 68,603 single cells with single-gene CRISPRi perturbations in K562 cells [5] [14].
Norman et al.: 91,205 single cells with single and combinatorial CRISPRa perturbations in K562 cells [5] [14].
Replogle et al. (K562 & RPE1): Two datasets with ~162,750 single cells each from a genome-wide CRISPRi screen in two different cell lines [5] [14].

2. Task Formulation: The core task was framed as a Perturbation Exclusive (PEX) prediction. Models were trained on a set of perturbations and then tested on their ability to predict the gene expression profile of held-out, unseen perturbations [5].

3. Evaluation Metrics:

Primary Metric: Pearson correlation in the differential expression space (Pearson Delta), which compares the predicted change from control against the observed change. This is considered more meaningful than correlation in raw expression space [5] [14].
Secondary Metrics: Pearson correlation for the top 20 differentially expressed genes, and L2 distance between predicted and observed expression profiles [5] [14].

4. Baseline Models:

Simple Baselines: "No change" (predicts control profile) and "Additive" (for double perturbations, predicts the sum of single perturbation effects) [14].
Train Mean: Predicts the average expression profile of all training perturbations for every test case [5] [14].
Standard ML Models: Random Forest, k-Nearest Neighbors, and Elastic-Net models using prior biological knowledge features like Gene Ontology (GO) vectors or model-generated gene embeddings [5].

Benchmarking Zero-Shot Cell-Level Tasks

1. Feature Extraction: Models were evaluated in a zero-shot setting. This means their pretrained weights were frozen, and cell (or gene) embeddings were extracted without any further task-specific fine-tuning. This tests the generalizable biological knowledge acquired during pretraining [4] [15].

2. Downstream Tasks & Evaluation:

Cell Type Annotation: The quality of cell embeddings was assessed by performing clustering and measuring the agreement with known cell type labels using metrics like Average BIO (AvgBio) score and Average Silhouette Width (ASW) [15].
Batch Integration: The ability of embeddings to mix cells from different technical batches while preserving biological separation was quantified using metrics like iLISI (integration Local Inverse Simpson's Index) and PCR (Principal Component Regression) batch [15].
Datasets: Benchmarks used multiple public datasets (e.g., Tabula Sapiens, Pancreas, PBMC) with high-quality labels to ensure robust evaluation [4] [15].

Benchmarking Gene Network Inference

1. Benchmark Suite: Evaluations were conducted using CausalBench, a suite designed for network inference on real-world, large-scale single-cell perturbation data [27].

2. Ground Truth Challenge: Since the true causal graph is unknown in biological systems, CausalBench uses biologically-motivated metrics and distribution-based interventional measures instead of a known graph [27].

3. Evaluation Metrics:

Biology-Driven Evaluation: Precision and recall of predicted gene-gene interactions against a consensus network built from biological prior knowledge [27].
Statistical Evaluation:
- Mean Wasserstein Distance: Measures if predicted interactions correspond to strong causal effects.
- False Omission Rate (FOR): Measures the rate at which true causal interactions are missed by the model [27].

Model Architectures and Workflows

The following diagrams illustrate the core architectures of the foundation models and the workflow for a typical benchmarking study.

Diagram 1: Model Architectures Comparison.

Diagram 2: Benchmarking Workflow.

The Scientist's Toolkit: Research Reagent Solutions

Table 4: Key Computational Tools and Datasets for Benchmarking

Tool / Dataset Name	Type	Primary Function in Benchmarking	Key Features / Notes
Perturb-seq Datasets (Adamson, Norman, Replogle)	Biological Dataset	Provides ground-truth data for training and evaluating perturbation prediction models.	Combines CRISPR perturbations with single-cell RNA-seq; enables causal inference [5] [14].
CausalBench	Software Benchmark Suite	Evaluates gene network inference methods on real-world interventional data.	Provides biologically-motivated metrics in the absence of a known ground-truth graph [27].
Gene Ontology (GO)	Knowledge Database	Provides biological prior knowledge features for baseline machine learning models.	A graph-based ontology of biological terms; used as feature vectors for genes [5].
Harmony	Computational Algorithm	A leading baseline method for batch integration of single-cell data.	Used as a performance benchmark for foundation models' batch correction capabilities [15].
scVI	Computational Algorithm	A generative deep learning model for single-cell data analysis.	Used as a performance benchmark for tasks like cell type annotation and batch integration [15].

Navigating Limitations and Enhancing Model Performance

The emergence of single-cell foundation models (scFMs) like scGPT and scFoundation represents a transformative development in computational biology, promising to leverage patterns learned from millions of cells to predict cellular behavior. These models are designed to capture universal principles of gene regulation and cell state dynamics, with the ultimate goal of accurately predicting cellular responses to genetic and chemical perturbations—a capability with profound implications for drug discovery and therapeutic development. However, rigorous independent benchmarking has revealed a surprising paradox: these sophisticated models frequently fail to outperform deliberately simple baseline methods in critical zero-shot learning scenarios, where models must make predictions without task-specific fine-tuning.

This performance gap exposes fundamental challenges in current approaches to model development and evaluation within the single-cell domain. Understanding the limitations of these models is not merely an academic exercise but a practical necessity for researchers and drug development professionals who rely on computational predictions to guide experimental design and resource allocation. This guide provides an objective comparison of scGPT and scFoundation against simpler alternatives, presenting comprehensive experimental data and methodologies to inform selection criteria for perturbation prediction tasks.

Experimental Frameworks for Zero-Shot Evaluation

Benchmarking Design Principles

Zero-shot evaluation assesses models on tasks they haven't been specifically fine-tuned for, testing their ability to generalize beyond their original training objectives [16]. This approach is particularly valuable for assessing foundation models because it mirrors real-world discovery settings where labeled data for specific perturbations may be unavailable [28]. Proper benchmark design must account for multiple generalization scenarios, including Perturbation Exclusive (PEX) settings where models predict effects of entirely novel perturbations, and Cell Exclusive (CEX) settings where models generalize to unseen cell types or contexts [5].

The most informative benchmarks incorporate multiple datasets with varying technical and biological complexities. For single-cell perturbation prediction, ideal benchmarks should include datasets generated using different experimental techniques (e.g., CRISPRi, CRISPRa), across multiple cell lines, and with both single and combinatorial perturbations [5] [14]. This diversity helps distinguish models that have learned fundamental biological principles from those that have merely memorized dataset-specific correlations.

Key Methodological Approaches

Table 1: Core Evaluation Methodologies for Perturbation Prediction

Method Category	Representative Examples	Key Characteristics	Primary Applications
Foundation Models	scGPT, scFoundation, Geneformer	Transformer architectures pre-trained on millions of cells; require fine-tuning or used zero-shot	Cell type annotation, perturbation response prediction, batch integration
Baseline Models	Train Mean, Additive Model	Predict average of training samples or sum of individual effects	Simple benchmarks for model performance
Traditional ML	Random Forest, k-NN, ElasticNet	Use biological features (GO terms, embeddings); limited parameters	Perturbation prediction with biological priors
Linear Models	Matrix factorization approaches	Learn low-dimensional representations of genes and perturbations	Predicting effects of unseen perturbations

Independent evaluations have employed several consistent methodological approaches across studies. For perturbation prediction, models typically receive as input gene expression vectors from unperturbed cells along with a representation of the perturbation, then generate predicted post-perturbation expression profiles [5]. Predictions are made at single-cell level but are often aggregated to pseudo-bulk profiles for evaluation stability.

The most critical evaluation metric involves calculating Pearson correlations in the differential expression space (perturbed minus control expression) rather than raw expression space, as the latter tends to be dominated by baseline expression levels of highly expressed genes [5] [14]. Additional evaluation dimensions include performance on top differentially expressed genes, genetic interaction prediction (for combinatorial perturbations), and generalization to unseen cell types or conditions.

Figure 1: Experimental workflow for benchmarking perturbation prediction models, showing input data types, model approaches, and evaluation strategies.

Comparative Performance Analysis

Perturbation Prediction Capabilities

Independent benchmarking studies have consistently demonstrated that simpler approaches frequently match or exceed the performance of foundation models in predicting transcriptional responses to genetic perturbations. In one comprehensive evaluation, the simplest baseline—predicting the mean of training examples (Train Mean)—outperformed both scGPT and scFoundation across four different Perturb-seq datasets when measuring Pearson correlation in differential expression space [5]. More notably, random forest regressors using Gene Ontology (GO) vectors as features substantially outperformed foundation models by a large margin (Pearson Delta: 0.739 vs. 0.641 for scGPT on the Adamson dataset) [5].

Table 2: Performance Comparison Across Perturbation Datasets (Pearson Delta Metric)

Model	Adamson Dataset	Norman Dataset	Replogle K562	Replogle RPE1
Train Mean	0.711	0.557	0.373	0.628
scGPT	0.641	0.554	0.327	0.596
scFoundation	0.552	0.459	0.269	0.471
RF + GO Features	0.739	0.586	0.480	0.648
RF + scGPT Embeddings	0.727	0.583	0.421	0.635

For the challenging task of predicting double perturbation effects, foundation models have shown particular limitations. In evaluations using the Norman dataset (combinatorial CRISPRa perturbations), even the "no change" baseline—which always predicts expression identical to control conditions—outperformed specialized deep learning models including GEARS, scGPT, and scFoundation [14]. Furthermore, foundation models demonstrated poor capability in identifying genetic interactions, with most models predominantly predicting buffering interactions and rarely correctly predicting synergistic effects [14].

Zero-Shot Capabilities for Cell Type Identification and Batch Integration

Beyond perturbation prediction, zero-shot performance is crucial for exploratory tasks like cell type identification and batch integration, where predefined labels may be unavailable. Evaluations across multiple datasets reveal that both scGPT and Geneformer underperform established baselines in these settings [16]. In cell type clustering, selecting highly variable genes (HVG) consistently outperformed both foundation models across average BIO score and average silhouette width metrics [16]. Similarly, for batch integration tasks, foundation models generally failed to adequately correct for technical batch effects while preserving biological signal, with Harmony and scVI demonstrating superior performance [16].

The performance gap in batch integration is particularly striking. Visualization of embeddings from the Pancreas benchmark dataset (containing data from five different sources) revealed that while Geneformer and scGPT could integrate different experiments using the same technique, they generally failed to correct for batch effects between different techniques [16]. In these visualizations, the primary structure in foundation model embedding spaces was driven by batch effects rather than biological meaningful categories.

Architectural and Methodological Limitations

Fundamental Constraints in Current Approaches

The consistent underperformance of foundation models relative to simpler alternatives suggests systemic limitations in current approaches rather than implementation-specific issues. Several interconnected factors contribute to this performance gap:

Pretraining-finetuning mismatch: The masked language modeling objective used during pretraining may not optimally prepare models for perturbation prediction tasks [16]. While this approach effectively teaches models gene-gene correlations in baseline states, it provides limited guidance for predicting dynamic responses to interventions.
Low perturbation-specific variance: Commonly used benchmark datasets exhibit limited perturbation-specific signal relative to technical and biological noise [5]. This low signal-to-noise ratio makes it difficult for complex models to distinguish meaningful patterns from stochastic variation.
Inefficient knowledge extraction: Despite extensive pretraining, foundation models may not effectively distill biologically meaningful representations. This is evidenced by the superior performance of random forest models using foundation model embeddings compared to the end-to-end fine-tuned models themselves [5].

Figure 2: Key limitations of current single-cell foundation models that contribute to their underperformance relative to simpler approaches.

Emerging Solutions and Alternative Approaches

Recent research has begun addressing these limitations through innovative architectural and methodological improvements:

Efficient fine-tuning techniques: Approaches like the single-cell Drug-Conditional Adapter (scDCA) enable parameter-efficient fine-tuning by training less than 1% of original foundation model parameters while incorporating information from novel modalities (e.g., chemical structures) [29]. This preserves rich biological representations learned during pretraining while adapting to specific prediction tasks.
Enhanced benchmarking frameworks: Unified evaluation frameworks like BioLLM provide standardized APIs for consistent model comparison across diverse tasks, revealing distinct performance trade-offs across different scFM architectures [17]. Such frameworks enable more rigorous and reproducible model assessment.
Knowledge-enhanced representations: Incorporating structured biological knowledge through knowledge graphs has shown promise in other zero-shot learning domains [30], suggesting potential pathways for improving single-cell foundation models through explicit integration of pathway and regulatory network information.

Practical Implementation Guide

Research Reagent Solutions

Table 3: Essential Computational Tools for Perturbation Modeling

Tool	Type	Primary Function	Considerations
scGPT	Foundation Model	Multi-task single-cell analysis	Strong overall performer; benefits from extensive pretraining
scFoundation	Foundation Model	Genetic perturbation prediction	Requires specific gene sets; limited flexibility
Geneformer	Foundation Model	Context-aware predictions	Limited zero-shot capabilities
Harmony	Batch Integration	Multi-dataset integration	Superior technical effect correction
scVI	Probabilistic Model	Dimensionality reduction, integration	Effective biological preservation
GEARS	Specialized Model	Genetic perturbation prediction	Utilizes prior knowledge of gene interactions

Recommended Workflows Based on Use Cases

For researchers seeking to implement perturbation prediction capabilities, evidence suggests the following strategic approaches:

For predicting novel genetic perturbations: Begin with random forest models using Gene Ontology features or pre-trained gene embeddings, as these consistently outperform more complex alternatives while offering greater computational efficiency and interpretability [5].
For zero-shot cell type identification: Prioritize established methods like Harmony or scVI over foundation models, as both demonstrate superior batch correction and cell type separation without requiring fine-tuning [16].
For predicting chemical perturbation effects: Consider efficient fine-tuning approaches like scDCA when foundation models must be employed, as these preserve pretrained knowledge while adapting to novel modalities with minimal parameter updates [29].
For benchmarking new models: Implement simple baselines (Train Mean, additive models) as essential reference points, as these provide critical context for evaluating whether model complexity translates to meaningful performance improvements [5] [14].

The consistent pattern of simple baselines matching or exceeding foundation model performance in zero-shot perturbation prediction represents both a challenge and opportunity for the single-cell biology community. Rather than invalidating the foundation model approach, these findings highlight the immaturity of current methodologies and the need for more biologically-grounded architectures, improved pretraining strategies, and more rigorous evaluation practices.

The most promising research directions include developing pretraining objectives that better capture causal relationships rather than correlations, incorporating explicit biological knowledge through structured data sources, and creating more challenging benchmarks with higher perturbation-specific signal. Additionally, parameter-efficient fine-tuning techniques that preserve foundational knowledge while adapting to specific tasks represent a practical path forward for applying these models to real-world discovery settings.

For researchers and drug development professionals, the current evidence suggests a cautious approach to adopting foundation models for critical perturbation prediction tasks. While their theoretical potential remains substantial, practical implementations should prioritize robust benchmarking against simple alternatives and selective application to tasks where they demonstrate clear, measurable advantages over more straightforward approaches.

The development of single-cell foundation models (scFMs) like scGPT and scFoundation represents a transformative advance in computational biology, promising to predict cellular responses to genetic and chemical perturbations with high accuracy [1]. These transformer-based models are pre-trained on millions of single-cell transcriptomes to learn fundamental principles of gene regulation and signaling, then fine-tuned for specific prediction tasks [12]. However, rigorous benchmarking studies have revealed surprising limitations in current evaluation paradigms, particularly stemming from low perturbation-specific variance in commonly used benchmark datasets [5]. This challenge fundamentally undermines our ability to accurately assess model performance and compare competing approaches.

The core issue identified in recent research is that standard perturbation datasets exhibit minimal variance that can be specifically attributed to the perturbations themselves, as opposed to general biological variation or technical noise [5]. When the signal of interest is weak relative to background variation, even simple baseline models can appear to perform comparably to sophisticated foundation models, making meaningful comparison difficult. This problem is compounded by the predominance of perturbation-exclusive (PEX) benchmarking setups that test only a model's ability to generalize to novel perturbations in familiar cell types, rather than evaluating performance across diverse cellular contexts [5]. Understanding and addressing this low-variance challenge is crucial for advancing the field of predictive cellular modeling.

Experimental Benchmarking Protocols

Standardized Evaluation Framework

To ensure fair comparison across models, researchers have established comprehensive benchmarking protocols that test performance across multiple dimensions. The most rigorous evaluations employ several key Perturb-seq datasets covering different perturbation types and cell lines [5]:

Adamson dataset: 68,603 single cells subjected to single perturbation CRISPR interference (CRISPRi)
Norman dataset: 91,205 single cells with single or dual CRISPRa (overexpression) perturbations
Replogle datasets: Two subsets (K562 and RPE1 cell lines) containing approximately 162,750 single cells each from genome-wide CRISPRi screens

The evaluation methodology follows a standardized workflow to ensure reproducible and comparable results across models [5]. Predictions are generated at the single-cell level, then aggregated to form pseudo-bulk expression profiles for each perturbation. These predicted profiles are compared against ground truth data using multiple correlation metrics:

Raw expression space: Pearson correlation of complete gene expression profiles
Differential expression space: Pearson correlation of perturbation-induced expression changes (Δ)
Top DE genes: Focused evaluation on the 20 most differentially expressed genes

Comparative Model Performance

Surprisingly, benchmarking results have demonstrated that even simple baseline models can outperform sophisticated foundation models on standard perturbation prediction tasks. The performance gap becomes particularly evident when evaluating in differential expression space, which more specifically captures perturbation effects [5].

Table 1: Performance Comparison Across Models (Pearson Δ Correlation)

Model	Adamson	Norman	Replogle K562	Replogle RPE1
Train Mean	0.711	0.557	0.373	0.628
scGPT	0.641	0.554	0.327	0.596
scFoundation	0.552	0.459	0.269	0.471
RF + GO Features	0.739	0.586	0.480	0.648
RF + scGPT Embeddings	0.727	0.583	0.421	0.635

The "Train Mean" baseline simply predicts the average expression profile from training examples for all perturbations, yet it consistently outperforms both scGPT and scFoundation across most datasets [5]. Even more strikingly, random forest models using biologically informed features (Gene Ontology vectors) substantially outperform all foundation models, suggesting that current scFMs may not be effectively leveraging their pretrained biological knowledge for perturbation prediction tasks.

The Low Variance Challenge in Perturbation Datasets

Quantifying Dataset Limitations

The underlying issue with current benchmarking approaches stems from the low perturbation-specific variance in commonly used datasets. When the expression changes induced by perturbations are minimal compared to background biological variation and technical noise, models struggle to identify true signal, and benchmarking becomes unreliable [5].

Table 2: Characteristics of Perturbation Benchmark Datasets

Dataset	Cell Count	Perturbation Type	Perturbation Variance	Primary Limitation
Adamson	68,603	CRISPRi (single)	Low	Minimal expression changes
Norman	91,205	CRISPRa (single/dual)	Low-Medium	Combinatorial complexity
Replogle K562	~162,750	CRISPRi (genome-wide)	Low	High background noise
Replogle RPE1	~162,750	CRISPRi (genome-wide)	Low	Cell-type specific effects

The fundamental problem is that these datasets were primarily designed to detect differentially expressed genes rather than to train complex predictive models. Consequently, the effect sizes for most perturbations are quite small, with only subtle changes to the transcriptional landscape [5]. This creates a scenario where models that simply learn to predict average expression patterns can appear deceptively competent, as they minimize overall error without truly capturing perturbation-specific effects.

Impact on Model Assessment

The low variance problem manifests in several specific challenges for benchmarking:

Signal masking: Weak perturbation signals are obscured by technical noise and biological variation, making it difficult for models to learn true causal relationships.
Metric inflation: High correlation scores in raw expression space create a false impression of model capability, as these metrics primarily reflect accurate prediction of baseline expression rather than perturbation effects.
Generalization failure: Models that appear to perform well on standard benchmarks may fail dramatically when applied to datasets with stronger perturbation effects or different cellular contexts.

Recent studies have demonstrated that the low variance issue is particularly problematic for transformer-based foundation models, which may require larger effect sizes to effectively leverage their attention mechanisms and capture meaningful gene-gene interactions [5]. Simpler models based on biological priors (like GO term embeddings) appear somewhat more robust to this challenge, potentially because they incorporate external knowledge that helps distinguish signal from noise.

Advanced Benchmarking Strategies

Improved Evaluation Metrics

To address the limitations of standard correlation-based metrics, researchers have developed more sophisticated evaluation approaches that specifically target perturbation effects:

Differential expression precision: Measuring how well models identify truly differentially expressed genes rather than just matching expression magnitudes
Pathway-level consistency: Evaluating whether predicted expression changes align with known biological pathways and processes
Zero-shot generalization: Testing model performance without fine-tuning to assess foundational biological knowledge [16]

The zero-shot evaluation approach has been particularly revealing, demonstrating that both scGPT and Geneformer underperform simpler methods like highly variable gene selection or established integration tools like Harmony and scVI when applied without task-specific fine-tuning [16]. This suggests that current pretraining objectives may not effectively capture transferable biological principles.

Dataset Enhancement Approaches

Addressing the low variance challenge requires both improved datasets and more sophisticated analytical approaches:

Stronger perturbations: Incorporating datasets with more dramatic transcriptional changes, such as strong cytokine stimulations or transcription factor activations
Multi-modal integration: Combining transcriptomic data with epigenetic, proteomic, or spatial information to provide additional biological context
Time-series designs: Capturing dynamic responses to perturbations rather than just endpoint measurements
Cross-cell-type evaluations: Systematically testing model generalization across diverse cellular contexts rather than just within a single cell line

Recent model development has begun to address these challenges. For instance, CellFM—trained on 100 million human cells with 800 million parameters—shows improved performance in gene function prediction and cell annotation tasks, though its perturbation prediction capabilities still require comprehensive evaluation [12].

Research Reagent Solutions

Table 3: Essential Computational Tools for Perturbation Modeling

Resource	Type	Primary Function	Application in Benchmarking
scGPT	Foundation Model	Gene expression prediction	Baseline for transformer-based approaches [5]
scFoundation	Foundation Model	Masked autoencoding of expression	Comparison model for benchmarking [5]
CellFM	Large-scale Foundation Model	Multi-task single-cell analysis	Emerging model with 100M cell training [12]
Geneformer	Foundation Model	Rank-based gene modeling	Zero-shot performance evaluation [16]
Harmony	Integration Tool	Batch effect correction	Baseline for dataset integration [16]
scVI	Probabilistic Model	Dimensionality reduction	Reference for clustering performance [16]
CELLxGENE	Data Platform	Curated single-cell data	Source of standardized training data [1]
Perturb-seq	Technology	CRISPR screening + scRNA-seq	Primary data generation method [5]

The benchmarking challenges posed by low-variance perturbation datasets represent a critical obstacle for advancing single-cell foundation models. Current evidence suggests that sophisticated transformer-based models like scGPT and scFoundation may not be effectively leveraging their architectural advantages for perturbation prediction, as simpler approaches consistently outperform them on standard benchmarks [5]. This performance gap appears to stem from both dataset limitations and potential shortcomings in model pretraining objectives.

Moving forward, the field requires several key advances: (1) development of higher-quality benchmarking datasets with stronger perturbation effects and richer biological contexts; (2) more sophisticated evaluation metrics that specifically assess perturbation-specific prediction rather than overall expression matching; and (3) improved model architectures and pretraining strategies that better capture causal biological relationships. The recent emergence of even larger models like CellFM trained on 100 million cells suggests that scaling alone may not address these fundamental challenges [12]. Instead, more targeted approaches combining biological prior knowledge with flexible deep learning architectures may be necessary to truly advance the state of the art in perturbation modeling.

As benchmarking methodologies continue to evolve, researchers should prioritize comprehensive evaluation across multiple biological contexts, careful examination of zero-shot capabilities, and rigorous comparison against simple but biologically informed baseline models [16]. Only through such rigorous approaches can we develop foundation models that genuinely advance our ability to predict and understand cellular responses to perturbation.

In the rapidly evolving field of single-cell genomics, foundation models like scGPT and scFoundation represent a significant leap forward, leveraging transformer architectures to interpret cellular "language" [1]. These models are pre-trained on millions of single-cell transcriptomes, learning fundamental principles of gene regulation and cell state that can be adapted to various downstream tasks through fine-tuning [1]. The core challenge, however, lies in effectively adapting these massive models to specific biological questions—such as predicting cellular responses to genetic perturbations—without requiring prohibitive computational resources or falling prey to overfitting on limited experimental data.

Recent comprehensive benchmarking studies have revealed surprising limitations in these foundation models. When evaluated for predicting post-perturbation gene expression profiles, even the simplest baseline models—such as predicting the mean expression from training examples—frequently outperformed sophisticated foundation models like scGPT and scFoundation [5] [14]. These findings highlight the critical importance of selecting appropriate optimization strategies when adapting pre-trained models, making the comparison between full fine-tuning, layer freezing, and Low-Rank Adaptation (LoRA) not merely technical but essential for advancing biological discovery.

Performance Benchmarking: scGPT vs. scFoundation

Key Findings from Perturbation Prediction Studies

Independent benchmarking studies have systematically evaluated scGPT and scFoundation against deliberately simple baselines for predicting transcriptome changes after genetic perturbations. The results have been sobering for proponents of large foundation models. Across multiple Perturb-seq datasets—including studies by Adamson, Norman, and Replogle—foundation models generally underperformed compared to a simple baseline that predicts the mean of training samples (Train Mean) [5]. Furthermore, standard machine learning models like Random Forest regressors, when equipped with biologically meaningful features such as Gene Ontology (GO) vectors, outperformed foundation models by a large margin [5].

A study published in Nature Methods (2025) reached similar conclusions, finding that no deep learning model could consistently outperform simple linear baselines or the mean prediction for forecasting the effects of unseen single or double perturbations [14]. This research also discovered that using the gene embeddings learned by scGPT and scFoundation within a simple linear model often achieved better performance than the fine-tuned foundation models themselves, suggesting that the pretrained representations contain valuable information that may be lost or poorly utilized during full fine-tuning [14].

Quantitative Performance Comparison

Table 1: Benchmarking Results on Perturbation Prediction Tasks (Pearson Correlation in Differential Expression Space)

Model / Dataset	Adamson	Norman	Replogle K562	Replogle RPE1
Train Mean	0.711	0.557	0.373	0.628
scGPT	0.641	0.554	0.327	0.596
scFoundation	0.552	0.459	0.269	0.471
RF with GO Features	0.739	0.586	0.480	0.648
RF with scGPT Embeddings	0.727	0.583	0.421	0.635

Source: Adapted from BMC Genomics [5]

Table 2: Genetic Interaction Prediction Performance (Area Under Curve)

Model	Performance (AUC)
No Change Baseline	0.72
Additive Model	0.50
GEARS	0.67
scGPT	0.68
scFoundation	0.65
Geneformer*	0.58
UCE*	0.62

Source: Adapted from Nature Methods [14]. *Models not designed for this task, used with linear decoder.

Computational Cost Analysis

The computational resources required for fine-tuning these foundation models is substantial, yet not correlated with performance on perturbation prediction tasks. scFoundation, trained on approximately 50 million human cells with ~0.1 billion parameters, and scGPT, trained on over 33 million human cells, both require significant GPU memory and training time for full fine-tuning [31]. One benchmarking study noted that despite these substantial computational investments, the foundation models were consistently outperformed by simpler, more efficient approaches [14].

Optimization Strategies: A Technical Deep Dive

Full Fine-Tuning

Methodology: Full fine-tuning involves continuing the training process of all layers and parameters in a pre-trained model on a new, task-specific dataset. The entire weight matrix (W₀) is updated to W = W₀ + ΔW through backpropagation [32].

Advantages and Disadvantages: This approach typically provides the highest baseline accuracy and task performance since the model can fully adjust all parameters to the new data [33] [34]. However, it demands enormous computational resources—often requiring multi-GPU setups (A100/H100) and substantial training time [35] [33]. Full fine-tuning also risks catastrophic forgetting, where the model over-specializes to the fine-tuned task and loses general knowledge acquired during pre-training [35].

Applications in Single-Cell Analysis: In the context of scGPT and scFoundation, full fine-tuning would theoretically allow the model to completely adapt its understanding of gene-gene relationships to specific perturbation contexts. However, given the limited size of most perturbation datasets (often with only hundreds of perturbations), this approach is prone to overfitting, potentially explaining the poor benchmarking performance observed in recent studies [5] [14].

Layer Freezing (Partial Fine-Tuning)

Methodology: Layer freezing, a form of specification-based parameter-efficient fine-tuning, involves fine-tuning only a subset of the model's layers while keeping the majority frozen [36]. For example, researchers might freeze the earlier layers of scGPT that capture general gene relationships while unfreezing and fine-tuning only the final layers for task-specific adaptation.

Advantages and Disadvantages: This approach significantly reduces computational requirements and mitigates catastrophic forgetting by preserving the foundational knowledge in frozen layers [36]. The tradeoff is potentially lower task performance compared to full fine-tuning, as the model has limited adaptability. A critical consideration is determining which layers to freeze—a decision that often requires domain expertise and experimentation [36].

Evidence from Single-Cell Research: Studies evaluating parameter-efficient methods for pre-trained models in annotating scRNA-seq data have found that freezing layers tuning (FL) can achieve performance comparable to vanilla fine-tuning while dramatically reducing tunable parameters [36]. When applied to scBERT (a transformer model for single-cell data), layer freezing maintained strong performance on cell type annotation tasks while offering significant efficiency gains [36].

Low-Rank Adaptation (LoRA)

Methodology: LoRA is a reparameterization-based PEFT method that freezes the pre-trained model weights and injects trainable rank decomposition matrices into transformer layers [32] [36]. Instead of updating the entire weight matrix ΔW, LoRA approximates it with the product of two smaller matrices ΔW = BA, where B and A have much lower dimensions [32]. For a layer with d×d parameters, LoRA reduces trainable parameters to 2×d×r, where r is the rank (typically 4-64) [32].

Advantages and Disadvantages:

Reduced Memory Requirements: Cuts GPU memory needs dramatically, often from hundreds of GB to just a few GB [32]
Higher Training Speed: Fewer parameters lead to faster training cycles [33]
Zero Inference Latency: Adapter weights can be merged with base weights after training [32]
Task Switching: Multiple lightweight adapters can be created for different tasks [32]
Storage Efficiency: Only need to store small adapter weights (MBs) rather than full models (GBs) [32]

The primary challenge lies in balancing rank and performance—lower ranks save resources but may not capture task complexity [32].

Applications in Single-Cell Foundation Models: The recently introduced CellFM model incorporates LoRA modules to reduce trainable parameters during fine-tuning for new datasets [31]. This approach demonstrates how LoRA can enable efficient adaptation of large single-cell foundation models (800M parameters in CellFM) without compromising performance across diverse applications like cell annotation and perturbation prediction [31].

Table 3: Comparison of Fine-Tuning Strategies for Single-Cell Foundation Models

Feature	Full Fine-Tuning	Layer Freezing	LoRA
Trainable Parameters	100%	1-20%	1-5%
GPU Memory Requirements	Very High	Moderate	Low
Task Performance	Highest (theoretical)	Moderate	Near-full
Risk of Overfitting	High	Moderate	Low
Training Speed	Slow	Moderate	Fast
Inference Overhead	None	None	None (when merged)
Multiple Task Support	Poor (separate model per task)	Moderate	Excellent (adapter swapping)

Experimental Protocols for Method Evaluation

Benchmarking Framework for Perturbation Prediction

Dataset Preparation: The standard benchmarking protocol utilizes multiple Perturb-seq datasets to evaluate generalization capabilities [5] [14]:

Adamson Dataset: 68,603 single cells with single perturbation CRISPRi
Norman Dataset: 91,205 single cells with single or dual CRISPRa perturbations
Replogle Dataset: Two subsets (K562 and RPE1 cell lines) with ~162,000 single cells each from genome-wide CRISPRi screens

Evaluation Methodology:

Implement Perturbation Exclusive (PEX) evaluation—assessing model ability to handle unseen perturbations
Generate predictions at single-cell level, then average to pseudo-bulk expression profiles
Compare predictions to ground truth using:
- Pearson correlation in raw gene expression space
- Pearson correlation in differential expression space (perturbed minus control)
- Performance on top 20 differentially expressed genes

Baseline Models: Include simple baselines like Train Mean (average of training pseudo-bulk profiles) and Random Forest regressors with biological features (GO vectors, gene embeddings) [5].

Implementation Protocols for Optimization Strategies

Full Fine-Tuning Protocol:

Initialize with pre-trained scGPT or scFoundation weights
Continue training on target perturbation dataset with all parameters unfrozen
Use learning rate of 1e-5 to 5e-5 (lower than pre-training)
Train until validation performance plateaus (typically 10-50 epochs)
Evaluate on held-out test perturbations [5]

Layer Freezing Protocol:

Load pre-trained model and freeze specific layers (typically early layers)
Unfreeze and fine-tune only final transformer layers and prediction head
Use moderate learning rate (5e-5 to 1e-4)
Monitor performance to determine optimal layer freezing strategy [36]

LoRA Implementation Protocol:

Select target layers for LoRA injection (attention and MLP layers)
Set LoRA rank r (typically 8-16 for initial experiments)
Initialize LoRA matrices A with small random weights and B with zeros
Train with learning rate 5e-4 to 1e-3 (10× higher than full fine-tuning)
Keep base model frozen, only update LoRA parameters
Merge LoRA weights with base model for inference [32] [31]

Visualization of Methodologies and Workflows

Optimization Strategy Architectures

Diagram 1: Architectural comparison of the three optimization strategies, showing parameter update patterns and data flow during fine-tuning. (Max Width: 760px)

Benchmarking Workflow for scFMs

Diagram 2: Benchmarking workflow for evaluating optimization strategies on single-cell foundation models using perturbation prediction tasks. (Max Width: 760px)

The Scientist's Toolkit: Essential Research Reagents

Table 4: Key Computational Tools and Resources for Single-Cell Foundation Model Research

Tool/Resource	Type	Function	Relevance to Optimization Strategies
Hugging Face PEFT	Software Library	Parameter-Efficient Fine-Tuning	Implements LoRA, Adapter methods for transformer models
scGPT Framework	Model Architecture	Single-Cell Foundation Model	Target for optimization strategies; provides pre-trained weights
scFoundation Model	Model Architecture	Single-Cell Foundation Model	Comparison model for benchmarking studies
Perturb-seq Datasets	Experimental Data	Benchmark Validation	Adamson, Norman, Replogle datasets for evaluating perturbation prediction
Gene Ontology (GO) Vectors	Biological Prior Knowledge	Feature Representation	Biological features for baseline models; enhances interpretability
MindSpore/PyTorch	AI Framework	Model Training & Inference	Computational backbone for implementing optimization strategies
CellFM	Integrated Framework	Large-scale scFM with LoRA	Example of LoRA integration in production-scale model [31]

The benchmarking evidence clearly indicates that despite their theoretical promise, single-cell foundation models like scGPT and scFoundation do not currently outperform simple baselines for perturbation prediction tasks [5] [14]. This surprising finding underscores the importance of rigorous evaluation and suggests that model size and pre-training scale alone are insufficient for mastering cellular response prediction.

Based on the comprehensive analysis of optimization strategies, we recommend:

Start with Simple Baselines: Before investing in foundation model fine-tuning, establish performance baselines using Random Forest models with biological features like GO terms or pre-computed gene embeddings [5].
Prioritize LoRA for Foundation Model Adaptation: When fine-tuning scGPT or similar models, LoRA provides the best balance of efficiency and performance, achieving near-full fine-tuning results with dramatically reduced resources [32] [31].
Use Layer Freezing for Transfer Learning: When adapting foundation models to conceptually similar tasks (e.g., different cell types), layer freezing offers a practical middle ground with reduced overfitting risk [36].
Reserve Full Fine-Tuning for Data-Rich Scenarios: Only consider full fine-tuning when you have large, high-quality perturbation datasets and ample computational resources—and even then, temper performance expectations based on current benchmarking results [5] [14].

The field of single-cell foundation models remains young, and current limitations in perturbation prediction likely reflect both methodological challenges and the inherent complexity of biological systems. As model architectures, training strategies, and optimization techniques continue to mature, the careful application of these adaptation strategies will be crucial for translating computational advances into biological insights.

The emergence of single-cell foundation models (scFMs), such as scGPT and scFoundation, has heralded a new era in computational biology, promising to decode the complex language of cellular processes from vast single-cell RNA sequencing (scRNA-seq) datasets. These models, often built on transformer architectures, are pre-trained on millions of cells to learn fundamental representations of gene regulation and cell states, which can then be fine-tuned for specific downstream tasks like perturbation prediction and cell type annotation [1]. However, recent rigorous benchmarking studies have revealed a critical insight: while these models learn powerful embeddings, their standalone performance in specific tasks, such as predicting gene perturbation effects, often fails to surpass deliberately simple baselines [5] [14]. This surprising finding has directed attention toward a more promising approach—strategically combining the latent representations learned by foundation models with structured biological prior knowledge. This guide provides a comprehensive comparison of this hybrid methodology, evaluating its performance against standalone models and detailing the experimental protocols that enable researchers to effectively leverage these integrated approaches for enhanced biological discovery.

Performance Benchmarking: Standalone Models vs. Hybrid Approaches

Limitations of Standalone Foundation Models

Recent comprehensive benchmarks have consistently demonstrated that scFMs, including scGPT and scFoundation, frequently underperform simple baseline models in critical prediction tasks. One landmark study found that even the simplest baseline—predicting the mean expression profile from training data—outperformed both scGPT and scFoundation in predicting post-perturbation gene expression profiles across four different Perturb-seq datasets [5]. Similarly, a benchmark published in Nature Methods concluded that "deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines," with none of the five evaluated foundation models surpassing a straightforward additive model [14].

The performance gap is particularly evident in differential expression prediction. In the Adamson dataset, the Train Mean baseline achieved a Pearson Delta correlation of 0.711, outperforming scGPT (0.641) and scFoundation (0.552). This pattern persisted across datasets, with Random Forest regression using Gene Ontology (GO) features substantially outperforming both foundation models (0.739 vs. 0.641 and 0.552, respectively, on the Adamson dataset) [5]. These results indicate that the current pretraining paradigms for scFMs may not be effectively capturing the causal relationships necessary for accurate perturbation response prediction.

The Emergence of Hybrid Strategies

In response to these limitations, researchers have developed hybrid approaches that combine foundation model embeddings with biological prior knowledge. This integration has demonstrated remarkable success, often bridging the performance gap between standalone foundation models and simple baselines. When scGPT's embeddings were used as features in a Random Forest model instead of being used in the fine-tuned scGPT model itself, performance improved significantly (Pearson Delta: 0.727 vs. 0.641 on the Adamson dataset), though it still trailed Random Forest with GO features (0.739) [5].

Another compelling approach utilizes natural language processing-based gene embeddings from scELMO, which incorporates textual descriptions of genes generated by large language models. Random Forest models using scELMO features achieved competitive performance (0.706 on Adamson) comparable to GO-based models [5]. This suggests that textual biological knowledge can serve as an effective prior when combined with statistical learning methods.

Table 1: Performance Comparison of Prediction Models Across Perturbation Datasets (Pearson Delta Metric)

Model Category	Specific Model	Adamson	Norman	Replogle K562	Replogle RPE1
Simple Baselines	Train Mean	0.711	0.557	0.373	0.628
Foundation Models	scGPT	0.641	0.554	0.327	0.596
	scFoundation	0.552	0.459	0.269	0.471
Biological Prior Models	RF + GO Features	0.739	0.586	0.480	0.648
Hybrid Models	RF + scGPT Embeddings	0.727	0.583	0.421	0.635
	RF + scELMO Embeddings	0.706	0.663	0.471	0.651

Task-Dependent Performance Considerations

The relative performance of different modeling approaches varies significantly across biological tasks. While foundation models may struggle with perturbation prediction, they excel in other domains. For drug response prediction, scFoundation achieved the highest mean F1 scores of 0.971 and 0.947 using layer-freezing and fine-tuning strategies, respectively, outperforming the lowest-performing model by over 50% in pooled-data evaluation [22] [11]. In cross-data evaluation for drug response, UCE performed best after fine-tuning (mean F1: 0.774), while scGPT demonstrated superior performance in zero-shot learning (mean F1: 0.858) [22] [11].

This task-dependent performance emphasizes that no single model consistently outperforms others across all scenarios. A comprehensive benchmark of six scFMs confirmed this finding, revealing that model performance must be evaluated in the context of specific applications, with different models excelling in tasks such as cell type annotation, batch integration, and drug sensitivity prediction [4] [3].

Table 2: Model Performance Across Different Biological Tasks

Task Category	Best Performing Model	Key Metric	Performance	Key Insight
Perturbation Prediction	RF + GO Features	Pearson Delta	0.739 (Adamson)	Biological priors outperform foundation models
Drug Response (Pooled-data)	scFoundation	F1 Score	0.971	Foundation models excel with sufficient data
Drug Response (Cross-data)	UCE (fine-tuned)	F1 Score	0.774	Fine-tuning enhances cross-dataset generalization
Drug Response (Zero-shot)	scGPT	F1 Score	0.858	Strong zero-shot transfer learning capability
Cell Type Annotation	scGraphformer	Accuracy	Superior across 20 datasets	Hybrid architecture captures cell-cell relationships

Experimental Protocols for Hybrid Representation Learning

Embedding Extraction from Foundation Models

The first critical step in creating hybrid models is extracting meaningful embeddings from pre-trained foundation models. For scGPT, which uses a transformer architecture with a GPT-based decoder, gene embeddings can be extracted from the input embedding layer [1] [14]. These embeddings typically have a dimensionality of 512 and are designed to capture contextual relationships between genes based on the model's pre-training on millions of single-cell transcriptomes [5].

For scFoundation, which employs an asymmetric encoder-decoder architecture, gene embeddings of dimension 768 can be extracted [4]. The pre-training process for scFoundation uses a read-depth-aware masked gene modeling objective with mean squared error loss, which encourages the model to learn biologically meaningful representations of genes that capture their functional relationships [4] [14].

The extraction protocol typically involves:

Loading the pre-trained model weights without the final prediction head
Passing normalized gene expression data through the model
Extracting the hidden representations before the final layer
Aggregating these representations across appropriate dimensions to obtain per-gene or per-cell embeddings

Integration with Biological Knowledge Graphs

The true power of hybrid approaches emerges when foundation model embeddings are integrated with structured biological knowledge. Gene Ontology (GO) provides a comprehensive computational model of biological systems, capturing functional relationships between genes across three domains: biological process, molecular function, and cellular component [5].

The integration protocol typically involves:

Feature Concatenation: Directly concatenating foundation model embeddings with GO term vectors
Graph Neural Networks: Constructing knowledge graphs where nodes represent genes with foundation model embeddings as features, and edges represent functional relationships from GO
Attention Mechanisms: Using attention layers to dynamically weight the importance of different knowledge sources

A successful implementation of this approach used Random Forest regression with GO features, which substantially outperformed standalone foundation models across multiple perturbation datasets [5]. The model took as input GO vectors representing the perturbed genes and achieved a Pearson Delta correlation of 0.739 on the Adamson dataset, compared to 0.641 for scGPT.

Evaluation Frameworks and Metrics

Robust evaluation is essential for comparing hybrid approaches against standalone models. The benchmarking pipeline should include:

Data Splitting Strategies:

Perturbation Exclusive (PEX): Evaluating generalization to unseen perturbations
Cell Exclusive (CEX): Evaluating generalization to unseen cell types
Combo-Sequence Split: For combinatorial perturbations, holding out specific combinations

Performance Metrics:

Pearson Correlation: Both in raw expression space and differential expression space
L2 Distance: For expression value prediction accuracy
F1 Score: For classification tasks like drug response prediction
Biological Consistency Metrics: scGraph-OntoRWR and Lowest Common Ancestor Distance (LCAD) to measure biological relevance of predictions [4] [3]

Baseline Models:

Simple mean prediction from training data
Additive models for combinatorial perturbations
Traditional machine learning models (ElasticNet, kNN, Random Forest) with biological features

Visualization of Workflows and Relationships

Hybrid Model Architecture Diagram

Benchmarking Evaluation Workflow

Essential Research Reagents and Computational Tools

Table 3: Key Research Reagents and Computational Tools for Hybrid Representation Learning

Category	Resource Name	Specifications/Features	Primary Application
Foundation Models	scGPT	50M parameters, 512D embeddings, GPT-based decoder	Gene embedding extraction, perturbation prediction
	scFoundation	100M parameters, 768D embeddings, encoder-decoder	Large-scale representation learning
Biological Knowledge Bases	Gene Ontology (GO)	Functional term relationships across three domains	Biological prior feature engineering
	KEGG Pathways	Curated pathway maps and functional hierarchies	Pathway-aware model integration
	REACTOME	Detailed curated biological pathway database	Biological validation and interpretation
Benchmark Datasets	Perturb-seq Datasets	Adamson, Norman, Replogle (K562, RPE1)	Perturbation prediction benchmarking
	Drug Response Collections	scDrugMap (326,751 cells, 36 datasets)	Drug sensitivity prediction evaluation
Computational Frameworks	scDrugMap	Python CLI and web server for drug response	End-to-end model evaluation platform
	scGraphformer	Transformer-GNN hybrid architecture	Cell type annotation and relationship learning
Evaluation Metrics	scGraph-OntoRWR	Cell ontology-informed consistency metric	Biological relevance assessment
	LCAD (Lowest Common Ancestor Distance)	Ontological proximity for misclassification error	Biological meaningfulness of errors

The integration of foundation model embeddings with biological prior knowledge represents a powerful paradigm for enhancing representation learning in computational biology. While standalone foundation models like scGPT and scFoundation have demonstrated impressive capabilities in certain domains, particularly drug response prediction with sufficient data, their performance in critical tasks like perturbation prediction often trails simpler approaches that explicitly incorporate biological knowledge. The hybrid methodologies detailed in this guide—which combine the latent representations learned by foundation models with structured biological knowledge from sources like Gene Ontology—consistently achieve superior performance across multiple benchmarking scenarios.

This comparative analysis reveals that the future of biological representation learning lies not in increasingly larger foundation models alone, but in the thoughtful integration of these models with the rich structured knowledge accumulated through decades of biological research. As the field advances, the most impactful approaches will likely be those that can most effectively bridge the gap between data-driven representation learning and established biological principles, creating models that are both statistically powerful and biologically meaningful.

Mitigating Batch Effects and Technical Variability in Model Embeddings

The emergence of single-cell foundation models (scFMs) like scGPT and scFoundation promises to revolutionize biological discovery by providing a unified framework for analyzing cellular transcriptomes. A core claim of these models is that their embeddings—internal representations of cellular states—can effectively separate biological signals from non-biological noise, a capability paramount for robust single-cell analysis [13]. Technical variability, or "batch effects," introduced by different sequencing protocols, laboratories, or experimental conditions, poses a significant obstacle to this goal. If not properly corrected, these artifacts can obscure true biological differences, leading to misleading conclusions in downstream analyses [37]. Therefore, the ability of a model to generate embeddings that are invariant to technical confounders while preserving biological heterogeneity is a critical benchmark for its utility in real-world research and drug development. This guide objectively compares the performance of scGPT and scFoundation in mitigating batch effects, synthesizing the latest experimental data to inform their practical application.

Performance Comparison: scGPT vs. scFoundation

Recent independent benchmarking studies have rigorously evaluated scGPT and scFoundation against simpler models and each other. The results reveal distinct performance profiles, particularly in perturbation prediction and batch integration tasks. The table below summarizes key quantitative findings.

Table 1: Performance Comparison of scGPT and scFoundation on Key Benchmarks

Model	Primary Architecture	Performance on Perturbation Prediction (Pearson Delta, Mean across Datasets)	Performance on Batch Integration (iLISI score, example datasets)	Zero-shot Cell Type Clustering (AvgBIO score)	Key Strengths	Key Limitations
scGPT	Transformer-based (Value Categorization)	0.530 (Adamson, Norman, Replogle K562 & RPE1) [5]	Outperforms scVI on complex batch effects (Tabula Sapiens) [16]	Inconsistent; outperformed by HVG and scVI on most datasets [16]	Robust performance across multiple tasks; can handle complex technical and biological batch effects [16] [17]	Underperforms simple baselines in perturbation prediction; inconsistent zero-shot clustering [5] [16]
scFoundation	Transformer-based (Masked Autoencoder)	0.438 (Adamson, Norman, Replogle K562 & RPE1) [5]	Specific batch integration performance not detailed in results	Specific zero-shot clustering performance not detailed in results	Provides biologically meaningful gene embeddings [5] [17]	Underperforms simple baselines in perturbation prediction; requires specific gene sets, limiting applicability [5] [14]
Simple Baseline (Train Mean)	N/A	0.567 (Adamson, Norman, Replogle K562 & RPE1) [5]	N/A	N/A	Surprisingly strong baseline for perturbation prediction [5] [14]	Incapable of capturing complex biological interactions [14]
Random Forest + GO Features	Ensemble Learning	0.613 (Adamson, Norman, Replogle K562 & RPE1) [5]	N/A	N/A	Outperforms foundation models by a large margin in perturbation prediction [5]	Relies on prior biological knowledge (GO terms)

A critical insight from these benchmarks is that even simple models can rival or exceed the performance of large foundation models in specific tasks like perturbation prediction. For instance, a baseline that simply predicts the mean expression from the training data outperformed both scGPT and scFoundation across several datasets [5] [14]. Furthermore, a Random Forest model using Gene Ontology (GO) features "outperformed foundation models by a large margin" [5]. This suggests that the current general-purpose representations learned by scFMs may not yet be superior to task-specific models that incorporate curated biological knowledge.

Table 2: Analysis of Model Embeddings for Downstream Tasks

Embedding Type	Source Model	Utility in Downstream Prediction	Biological Meaningfulness
Gene Embeddings	scGPT	Effective when used in a simple linear model, outperforming scGPT's own fine-tuned decoder [14]	Captures some gene-gene relationships [5]
Gene Embeddings	scFoundation	Effective when used in a simple linear model [14]	Provides biologically meaningful features [17]
Gene Embeddings	scELMO	Similar performance to GO-based Random Forest models [5]	Derived from LLM-generated gene descriptions
Perturbation Embeddings	GEARS	Enables linear models to perform on par with the original model [14]	Encodes perturbation relationships

Experimental Protocols for Benchmarking

To ensure reproducibility and critical evaluation, understanding the standard protocols for benchmarking batch effect correction and embedding quality is essential. The following workflow outlines a typical evaluation pipeline.

Data Preparation and Benchmarking Setups

Benchmarks typically use publicly available scRNA-seq datasets with known, pronounced batch effects, such as the Pancreas dataset, which combines data from five different sources [16]. A critical step is pseudo-bulk creation, where gene expression profiles for each perturbation or condition are averaged to form a more stable profile for comparison [5]. Evaluation is often conducted under two main setups:

Perturbation Exclusive (PEX): Assesses the model's ability to generalize to unseen perturbations in a familiar cell type [5].
Cell Exclusive (CEX): Assesses generalization to unseen cell types for known perturbations.

Key Evaluation Metrics

The performance of embedding models is quantified using metrics that balance batch correction with biological fidelity.

Table 3: Key Metrics for Evaluating Embedding Quality

Metric	What It Measures	Interpretation
iLISI (Graph Integration Local Inverse Simpson's Index)	Batch mixing in local cell neighborhoods [37]	Higher scores indicate better batch integration.
NMI (Normalized Mutual Information)	Preservation of cell type identity after integration [37]	Higher scores indicate better preservation of biological signal.
Pearson Delta	Correlation between predicted and actual differential expression profiles [5]	Higher scores indicate more accurate perturbation prediction.
ASW (Average Silhouette Width) & AvgBIO	Cell type separation and clustering accuracy [16]	Higher scores indicate better separation of distinct cell types.

Visualizing the Batch Effect Challenge

The fundamental challenge in batch correction is to remove technical artifacts without erasing meaningful biological variation. The following diagram illustrates this problem and the desired outcome.

Successful benchmarking and application of scFMs require a suite of computational tools and data resources.

Table 4: Key Resources for scFM Research

Resource Name	Type	Primary Function	Relevance to Benchmarking
BioLLM Framework	Software Framework	Provides unified APIs for applying and evaluating different scFMs [17]	Standardizes model access and evaluation, enabling fair comparisons.
Perturb-seq Datasets	Data	Provides ground truth data on gene expression before/after genetic perturbation (e.g., Adamson, Norman, Replogle data) [5] [14]	Essential for evaluating model performance on perturbation prediction tasks.
CZ CELLxGENE	Data Platform	A curated corpus of standardized single-cell datasets [13]	Source of diverse, high-quality data for model pretraining and testing.
Harmony & scVI	Software Tools	Established methods for data integration and embedding generation [16]	Critical baseline models for benchmarking the performance of newer scFMs.
Gene Ontology (GO)	Knowledge Base	A structured repository of gene function annotations [5]	Used to create biologically meaningful features for baseline models.

Rigorous Benchmarking: Head-to-Head Performance Analysis

The emergence of large-scale foundation models trained on massive single-cell transcriptomics datasets has revolutionized computational biology, offering the potential to capture complex gene-gene relationships and cellular states. Among these, scGPT and scFoundation represent two leading approaches in the landscape of single-cell artificial intelligence. Within the specific context of drug response prediction—a critical task in oncology and therapeutic development—rigorous benchmarking is essential to guide model selection and application. This comparison guide focuses on evaluating these models under pooled-data evaluation scenarios, where models are trained and tested on aggregated data from multiple studies. This approach tests a model's ability to integrate diverse data sources and extract generalizable patterns, a capability with immense value for real-world drug discovery applications. Recent comprehensive studies, particularly the scDrugMap benchmark, have provided the community with robust, data-driven insights into the comparative performance of these models, consistently highlighting scFoundation's superior predictive capabilities in this specific evaluation setting [21] [11].

Quantitative Performance Comparison

The scDrugMap benchmark, an extensive framework for evaluating foundation models in drug response prediction, provides clear quantitative results from pooled-data evaluation. The table below summarizes the key performance metrics, measured by the F1 score, for the leading models.

Table 1: Model Performance in Pooled-Data Evaluation on Primary Data Collection (scDrugMap Benchmark)

Foundation Model	Training Strategy	Mean F1 Score	Performance Notes
scFoundation	Layer Freezing	0.971	Highest performing model; outperformed lowest by 54% [21].
scFoundation	Fine-tuning (LoRA)	0.947	Highest performing fine-tuned model [21] [11].
UCE	Fine-tuning (LoRA)	0.774	Top performer in cross-data evaluation after fine-tuning [21].
scGPT	Zero-shot Learning	0.858	Demonstrated superior performance in zero-shot setting [21].
scBERT	Layer Freezing	0.630	Lowest performing model in this benchmark [21].

The results demonstrate that scFoundation achieved the highest mean F1 scores of 0.971 (with layer freezing) and 0.947 (with fine-tuning using Low-Rank Adaptation) in the pooled-data evaluation on the primary collection of 326,751 single cells [21] [11]. This indicates that when data from multiple sources are aggregated, scFoundation's pretraining and architecture provide a significant advantage in accurately distinguishing between drug-sensitive and drug-resistant cells.

Experimental Protocols and Benchmarking Methodology

Understanding the experimental design behind these conclusions is crucial for interpreting the results.

The scDrugMap Framework and Data Curation

The scDrugMap benchmark provides a standardized environment for a fair comparison. Its key components include [21]:

Curated Datasets: A primary collection of 326,751 single tumor cells from 36 scRNA-seq datasets across 23 studies, covering 11 cancer types and three therapy categories (targeted therapy, chemotherapy, immunotherapy). A separate validation collection contained 18,856 cells.
Model Selection: The benchmark evaluated eight single-cell foundation models (including scFoundation and scGPT) and two general-purpose large language models.
Evaluation Scenarios: The critical distinction was between pooled-data evaluation (models trained and tested on aggregated data from multiple studies) and cross-data evaluation (models tested on held-out individual studies).
Training Strategies: To adapt foundation models, scDrugMap implemented both layer freezing (using the pretrained model as a fixed feature extractor) and fine-tuning with Low-Rank Adaptation (LoRA), which updates a small number of parameters to efficiently specialize the model.

Key Methodological Differences Between Models

The performance differences can be traced back to architectural and pretraining choices:

scFoundation: This model employs a value projection-based strategy, directly predicting raw gene expression values using a masked autoencoder (MAE) pretrained on ~50 million human cells with ~100 million parameters. This approach preserves the full resolution of the gene expression data [38] [12].
scGPT: This model uses a value categorization strategy, which bins continuous gene expression values into discrete categories and uses an autoregressive transformer architecture pretrained on over 33 million human cells [12].

The direct prediction of raw expression values by scFoundation may contribute to its advantage in capturing subtle, biologically relevant signals necessary for predicting complex phenotypes like drug response.

Visualizing the Experimental Workflow

The following diagram illustrates the logical workflow of the scDrugMap benchmarking process that leads to scFoundation's top-tier performance in pooled-data evaluation.

Diagram 1: scDrugMap Benchmarking Workflow. This workflow outlines the key stages in the scDrugMap pooled-data evaluation, from data input and model processing to the final performance assessment that identified scFoundation's superior performance.

For researchers aiming to reproduce or build upon these benchmarks, the following table details essential computational resources and their functions.

Table 2: Essential Research Reagents and Computational Resources

Resource / Solution	Function in Evaluation	Source / Reference
scDrugMap Framework	Provides the integrated benchmarking environment, data loaders, and evaluation scripts.	https://scdrugmap.com/ [21]
Primary Data Collection	The curated set of 326,751 single cells from 36 datasets; serves as the primary benchmark.	Manually curated from 23 published studies [21]
Low-Rank Adaptation (LoRA)	A parameter-efficient fine-tuning strategy used to adapt foundation models to the drug response task.	Hu et al. (2021) [21]
Pre-trained Model Weights (scFoundation)	The foundational parameters of scFoundation, enabling transfer learning.	https://aigp.biomap.com/ [39]
Pre-trained Model Weights (scGPT)	The foundational parameters of scGPT for comparative analysis.	Cui et al. (2024) [12]

The consistent findings from the scDrugMap benchmark firmly establish scFoundation as the leading model for drug response prediction in pooled-data evaluation scenarios. Its superior performance, evidenced by an F1 score exceeding 0.97, underscores the effectiveness of its value-projection-based architecture and large-scale pretraining when learning from aggregated, multi-study datasets. This capability is directly applicable to real-world drug discovery efforts that seek to integrate diverse experimental data to build robust predictive models.

However, model selection is context-dependent. While scFoundation excels in pooled-data evaluation, scGPT has demonstrated superior performance in zero-shot learning settings [21], and other models like UCE perform well in cross-data evaluations [21]. Therefore, the choice between scFoundation and alternatives should be guided by the specific experimental design and application requirements. Researchers are encouraged to leverage the scDrugMap platform and the resources outlined in this guide to conduct their own validations, further solidifying the evidence-based application of single-cell foundation models in accelerating therapeutic development.

The emergence of single-cell foundation models (scFMs) represents a transformative development in computational biology, promising to unlock deeper insights into cellular behavior and accelerate therapeutic discovery. These models, pre-trained on millions of single-cell transcriptomes, aim to capture universal biological principles that can be adapted to diverse downstream tasks. Among the most prominent scFMs are scGPT and scFoundation, both transformer-based architectures trained at unprecedented scale. However, rigorous independent benchmarking has revealed critical insights about their respective capabilities and limitations, particularly regarding cross-data generalization and zero-shot performance—the ability to perform tasks without task-specific training. This comparative analysis synthesizes evidence from multiple recent studies to evaluate scGPT's performance against scFoundation and other alternatives, focusing specifically on generalization capabilities that are essential for real-world biomedical applications where labeled data is scarce or novel cell types and perturbations are encountered.

Performance Comparison: Quantitative Benchmarking Across Tasks

Perturbation Response Prediction Capabilities

Predicting cellular responses to genetic perturbations constitutes a fundamental test for scFMs' understanding of gene regulatory networks. Recent benchmarks have evaluated scGPT and scFoundation against simpler baseline models on standardized Perturb-seq datasets, with revealing results.

Table 1: Performance Comparison on Perturbation Response Prediction (Pearson Correlation in Differential Expression Space)

Model	Adamson Dataset	Norman Dataset	Replogle K562	Replogle RPE1
Train Mean (Simplest Baseline)	0.711	0.557	0.373	0.628
scGPT	0.641	0.554	0.327	0.596
scFoundation	0.552	0.459	0.269	0.471
Random Forest with GO Features	0.739	0.586	0.480	0.648
Random Forest with scGPT Embeddings	0.727	0.583	0.421	0.635

[5] [14]

Surprisingly, even the simplest baseline—predicting the mean of training examples—outperformed both foundation models across all datasets [5]. Similarly, a Nature Methods study found that "deep-learning-based foundation models did not perform better than deliberately simplistic linear prediction models" for predicting gene perturbation effects [14]. However, when scGPT's pretrained gene embeddings were used in simpler random forest models, performance improved substantially, suggesting these embeddings do contain biologically meaningful information that the full fine-tuned model fails to leverage optimally [5].

Zero-Shot Generalization Capabilities

Zero-shot evaluation tests models' ability to perform tasks without any task-specific training, which is critical for exploratory research where labels are unknown. Recent benchmarking reveals important patterns in how scGPT and scFoundation perform in these settings.

Table 2: Zero-Shot Performance Across Critical Tasks (Relative Performance Ranking)

Task	scGPT	scFoundation	Geneformer	Simple Baselines (HVG, etc.)
Cell Type Clustering	Intermediate	Limited data	Lowest performance	Highest performance
Batch Integration	Variable	Limited data	Lowest performance	Highest performance
Unseen Drug Prediction	Strong (F1: 0.858)	Not top performer	Not evaluated	Intermediate
Unseen Cell Line Prediction	State-of-the-art	Not top performer	Not evaluated	Lower performance

[16] [22] [11]

In zero-shot cell type clustering and batch integration, both scGPT and Geneformer were consistently outperformed by simpler methods like Highly Variable Genes (HVG) selection and established integration tools like Harmony and scVI [16] [15]. However, scGPT demonstrated remarkable zero-shot capability in specific generalization tasks, achieving superior performance (F1 score: 0.858) in predicting responses to unseen drugs according to scDrugMap benchmarking [22] [11]. Additionally, scGPT-based approaches enabled "zero-shot generalization to unseen cell lines," representing a significant advancement for drug discovery applications [29].

Experimental Protocols and Methodologies

Standardized Benchmarking Frameworks

Recent independent evaluations have established rigorous methodologies for assessing scFMs. The benchmarking protocol for perturbation prediction typically involves fine-tuning pre-trained models on specific datasets followed by held-out evaluation [5] [14]. For genetic perturbation prediction, models are trained on single-gene perturbations and evaluated on their ability to predict effects of double-gene perturbations or unseen single-gene perturbations [5] [14]. The key innovation in recent benchmarks is the inclusion of deliberately simple baselines like "mean prediction" and linear models, which provide reality checks on claimed capabilities [5] [14].

Zero-shot evaluation protocols differ significantly, as they exclude task-specific fine-tuning entirely [16] [15]. In these frameworks, models generate cell embeddings that are directly evaluated on tasks like cell type clustering and batch correction using metrics such as Average BIO score (AvgBio) for clustering accuracy and Principal Component Regression (PCR) for batch effect removal [16] [15]. The scDrugMap framework introduces both pooled-data evaluation (standard fine-tuning) and cross-data evaluation (assessing generalization to novel contexts) [22] [11].

Critical Evaluation Metrics and Datasets

Benchmarking studies employ specialized metrics tailored to each task. For perturbation prediction, the Pearson Delta metric—correlation between predicted and actual differential expression profiles—has emerged as particularly informative because it focuses on expression changes rather than absolute values, which are dominated by highly expressed genes [5]. Additional metrics include L2 distance for top differentially expressed genes and genetic interaction detection capability [14].

For zero-shot evaluation, Average BIO score measures cell type clustering quality, while batch integration metrics quantify a model's ability to remove technical artifacts while preserving biological variation [16] [15]. In drug response prediction, F1 scores evaluate classification accuracy, particularly important for imbalanced datasets common in pharmaceutical applications [22] [11].

The most commonly used datasets in recent benchmarks include:

Norman et al. dataset: CRISPRa perturbations in K562 cells [5] [14]
Adamson et al. dataset: CRISPRi perturbations in K562 cells [5] [14]
Replogle et al. dataset: Genome-wide CRISPRi in K562 and RPE1 cells [5] [14]
Pancreas benchmark: Multiple technologies for batch integration [16] [15]
scDrugMap collection: 326,751 cells across 36 datasets for drug response [22] [11]

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Experimental Resources for Single-Cell Foundation Model Evaluation

Resource Category	Specific Examples	Function in Evaluation
Perturbation Datasets	Norman et al. (2019), Adamson et al. (2016), Replogle et al. (2022)	Provide ground truth for evaluating perturbation prediction capabilities
Benchmarking Platforms	scDrugMap, GEARS, Custom benchmarking pipelines	Standardized evaluation frameworks for fair model comparison
Evaluation Metrics	Pearson Delta, AvgBio score, F1 score, PCR score	Quantify model performance across different task types
Baseline Models	Train Mean, Random Forest with GO features, HVG selection	Provide critical reference points for interpreting model performance
Integration Tools	Harmony, scVI, Seurat	Established methods for comparison on integration tasks

[5] [16] [22]

Analysis of Generalization Strengths and Limitations

scGPT's Emerging Zero-Shot Capabilities

Despite overall mixed performance across benchmarks, scGPT demonstrates notable strengths in specific generalization scenarios. The model excels particularly in cross-data evaluation and zero-shot drug response prediction, suggesting its pre-training on 33 million human cells has conferred meaningful biological understanding that transfers to novel contexts [22] [11] [29]. This capability is particularly valuable for drug discovery, where researchers need to predict compound effects on cell types or disease states not included in training data.

The architecture of scGPT, which uses a perturbation token added to the perturbed gene token to model perturbation effects, appears to provide a flexible framework for generalizing to novel conditions [5] [29]. Additionally, scGPT's strong performance when its embeddings are used in simpler models indicates that the pre-training process successfully captures biologically meaningful relationships, even if the full fine-tuning pipeline doesn't always leverage this knowledge optimally [5].

Comparative Limitations of scFoundation

While scFoundation demonstrates strong performance in certain specialized tasks—particularly pooled-data evaluation where it achieved F1 scores of 0.971 in drug response prediction—it shows more limited generalization capability in cross-data and zero-shot settings [22] [11]. The model also faces practical limitations regarding gene set compatibility, as it "required each dataset to exactly match the genes from its own pretraining data," creating challenges for application to novel datasets [14].

scFoundation's architecture uses pretrained gene embeddings as inputs for graph neural-network based models like GEARS for perturbation prediction [5]. While this approach shows promise, current benchmarks indicate it hasn't yet surpassed simpler alternatives in generalization tasks. However, it's worth noting that scFoundation excels in read-depth enhancement and specific drug response prediction scenarios where data matches its pre-training specifications [38].

Current benchmarking evidence presents a nuanced picture of single-cell foundation model capabilities. While both scGPT and scFoundation show promising performance in specific domains, neither consistently outperforms simpler baseline methods across diverse tasks [5] [16] [14]. However, scGPT demonstrates distinctive strengths in cross-data and zero-shot generalization, particularly for drug response prediction in unseen cell types and conditions [22] [11] [29].

These findings have important implications for researchers and drug development professionals. Model selection should be guided by specific application requirements: scFoundation may be preferable for tasks involving well-characterized cellular systems where data matches its pre-training specifications, while scGPT appears better suited for exploratory research requiring generalization to novel biological contexts [22] [11] [4]. The consistent underperformance of both models compared to simple baselines in certain tasks highlights the importance of rigorous benchmarking and the need for continued methodological development [5] [14].

Future research directions should focus on improving model efficiency, enhancing zero-shot capabilities, and developing more biologically meaningful pretraining objectives. As noted in recent benchmarks, "the goal of providing a generalizable representation of cellular states and predicting the outcome of not-yet-performed experiments is still elusive" [14], indicating substantial room for advancement in the field of single-cell foundation models.

Within the rapidly evolving field of single-cell biology, foundation models like scGPT and scFoundation promise a transformative understanding of cellular behavior by leveraging vast amounts of transcriptomics data. These models are designed to capture universal patterns in gene regulation, which can then be fine-tuned for specific downstream tasks, such as predicting gene expression changes following genetic perturbations. Concurrently, traditional machine learning methods like Random Forest (RF), regularized linear models such as Elastic-Net, and straightforward analytical techniques like selecting Highly Variable Genes (HVG) have long served as reliable benchmarks for performance. This guide provides an objective, data-driven comparison of these foundational and traditional approaches, synthesizing findings from recent rigorous benchmarking studies to inform researchers and drug development professionals about their relative strengths and limitations in practical applications.

Results and Comparative Performance

Recent independent benchmark studies consistently reveal a significant finding: traditional methods, including simple baseline models, often meet or exceed the performance of sophisticated foundation models in critical tasks like perturbation prediction and cell type identification.

Performance in Post-Perturbation Gene Expression Prediction

A benchmark study evaluating scGPT and scFoundation against baseline models on four Perturb-seq datasets provides quantitative evidence of their relative performance, measured by the Pearson correlation of predicted versus actual differential gene expression (Pearson Delta) [5].

Table 1: Benchmarking Performance on Post-Perturbation Prediction (Pearson Delta)

Model / Dataset	Adamson	Norman	Replogle (K562)	Replogle (RPE1)
Train Mean (Baseline)	0.711	0.557	0.373	0.628
scGPT	0.641	0.554	0.327	0.596
scFoundation	0.552	0.459	0.269	0.471
Random Forest (GO features)	0.739	0.586	0.480	0.648
Random Forest (scGPT embeddings)	0.727	0.583	0.421	0.635

The data shows that the simple baseline of predicting the training set mean outperformed both foundation models across all datasets. Furthermore, a Random Forest (RF) model using Gene Ontology (GO) features as input "outperformed foundation models by a large margin" [5]. This superior performance was also consistent in a sub-analysis of combinatorial perturbations in the Norman dataset [5].

A separate study in Nature Methods confirmed these findings, noting that for predicting double perturbation effects, "all models had a prediction error substantially higher than the additive baseline," a simple model that sums individual logarithmic fold changes [14]. The study also developed a simple linear model that consistently matched or outperformed foundation models in predicting unseen single-gene perturbations [14].

Performance in Zero-Shot Cell Type Identification and Batch Integration

In tasks where models are used without fine-tuning (zero-shot), foundation models have shown limitations compared to established methods.

Table 2: Zero-Shot Performance on Cell Type Clustering (Average BIO Score)

Method	PBMC (12k)	Tabula Sapiens	Pancreas	Immune
HVG	~0.55	~0.72	~0.65	~0.70
scVI	~0.63	~0.70	~0.62	~0.64
Harmony	~0.57	~0.68	~0.60	~0.69
scGPT	~0.60	~0.65	~0.55	~0.63
Geneformer	~0.35	~0.40	~0.40	~0.35

In zero-shot cell type clustering, both HVG selection and models like scVI and Harmony consistently outperformed scGPT and Geneformer across multiple datasets, as measured by Average BIO score [16]. Geneformer's performance was particularly low. For batch integration, a task critical for combining datasets from different sources, "the best batch integration scores for all datasets were achieved by selecting HVG," with foundation models again underperforming relative to established baselines [16].

Experimental Protocols

To ensure the reproducibility and transparency of the comparisons cited, this section details the key experimental methodologies employed in the benchmark studies.

Benchmarking Workflow for Perturbation Prediction

The evaluation of perturbation prediction models follows a structured process to ensure a fair and comparable assessment.

Title: Perturbation Prediction Benchmarking Workflow

1. Data Collection and Preprocessing:

Datasets: Benchmarks typically use publicly available Perturb-seq datasets (e.g., Adamson, Norman, Replogle) which employ CRISPR-based perturbations (CRISPRi/CRISPRa) and single-cell RNA sequencing to profile resulting gene expression [5] [14].
Processing: Single-cell expression profiles are aggregated (pseudo-bulked) by perturbation condition to create an average expression profile for each perturbation [5].

2. Model Selection and Configuration:

Foundation Models: scGPT and scFoundation are used in their pre-trained form, then fine-tuned on the training split of the benchmark data according to the authors' specifications [5].
Baseline Models: These include:
- Train Mean: The average pseudo-bulk expression profile of all training set perturbations. All predictions are identical to this mean vector [5] [14].
- Random Forest (RF): Trained using prior biological knowledge as features, such as Gene Ontology (GO) vectors or gene embeddings from foundation models (e.g., scGPT, scFoundation) or language models (scELMO) [5].
- Additive Model: For double perturbations, predicts the sum of the log fold changes observed in the corresponding single perturbations [14].
- Linear Model: A simple linear model that maps perturbation and gene embeddings to expression outcomes [14].

3. Task Definition (Perturbation Exclusive - PEX):

The core task is to evaluate a model's ability to generalize to unseen perturbations. Models are trained on a set of perturbations and then tested on a held-out set of perturbations not seen during training [5].

4. Evaluation Metrics:

Primary Metric: Pearson correlation in the differential expression space (Pearson Delta). This involves comparing the predicted change in expression (perturbed vs. control) against the ground truth change, focusing the evaluation on the model's ability to capture the perturbation's effect rather than baseline expression [5].
Secondary Metrics: Evaluation of performance on the top 20 differentially expressed genes, and L2 distance between predicted and observed expression values [5] [14].

Workflow for Zero-Shot Evaluation

Evaluating models in a zero-shot setting tests their inherent biological understanding without task-specific fine-tuning.

Title: Zero-Shot Evaluation Workflow

1. Model Preparation:

Pre-trained models (scGPT, Geneformer) are loaded without any further fine-tuning on the target evaluation dataset [16].

2. Embedding Generation:

The target dataset (e.g., a pancreas dataset with batch effects) is passed through the pre-trained model to generate a low-dimensional vector representation (embedding) for each cell [16].

3. Downstream Task Execution:

Cell Type Clustering: A standard clustering algorithm is applied to the cell embeddings. The resulting clusters are compared to known cell type labels [16].
Batch Integration: The embeddings are visualized and analyzed to see if cells from different experimental batches are mixed together (indicating successful integration) while biological separation (e.g., by cell type) is maintained [16].

4. Evaluation Metrics:

Clustering: Metrics like Average BIO score, Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI) quantify how well the clusters match the true labels [16] [40].
Batch Integration: Metrics such as batch mixing score and Principal Component Regression (PCR) quantify the removal of technical batch effects while preserving biological variance [16].

This section details essential computational tools and data resources central to conducting benchmarking studies in single-cell genomics.

Table 3: Essential Resources for Single-Cell Benchmarking Studies

Category	Item / Software	Function in Research	Key Considerations
Benchmark Datasets	Perturb-seq (Adamson, Norman, Replogle)	Provides causal perturbation→expression data for training/evaluating predictive models.	Check for low perturbation-specific variance, which can complicate evaluation [5].
	CITE-seq Datasets (e.g., from SPDB)	Provides paired transcriptomic and proteomic data from the same cells for cross-modal method testing [40].	Enables benchmarking on consistent biological conditions across omics.
Foundation Models	scGPT	A transformer-based foundation model for single-cell data; used for prediction and generating gene/cell embeddings [5] [16].	Requires fine-tuning for specific tasks; zero-shot performance may be limited [16].
	Geneformer	Another transformer-based foundation model pre-trained on single-cell data [16].	Like scGPT, its zero-shot performance can be inconsistent [16].
Traditional ML Models	Random Forest (scikit-learn)	An ensemble tree-based model used for regression and classification. Often serves as a strong, interpretable baseline.	Can leverage biological features (GO terms) and often outperforms more complex models [5].
	Elastic-Net (GLMNET)	A linear model combining L1 and L2 regularization. Effective for feature selection and dealing with correlated variables [41].	Useful for biomarker identification and building parsimonious models.
Analysis & Evaluation	HVG Selection	A standard preprocessing step to select genes with high cell-to-cell variation for downstream analysis like clustering.	A simple yet highly effective baseline for tasks like clustering and batch integration [16].
	scVI / Harmony	Tools for single-cell data analysis, specializing in probabilistic modeling (scVI) and batch correction (Harmony) [16].	Often outperform foundation models in tasks like batch integration and cell type clustering [16].

scGPT for Multi-omics Integration vs. scFoundation for Drug Response

Single-cell foundation models (scFMs) represent a transformative advance in computational biology, leveraging large-scale deep learning to interpret complex single-cell data. Trained on millions of single-cell transcriptomes through self-supervised learning, these models develop foundational knowledge of cellular biology that can be adapted to various downstream tasks [1]. Among the leading scFMs, scGPT and scFoundation have emerged as prominent yet specialized models, each demonstrating distinct strengths across different application domains. This guide provides a comprehensive, evidence-based comparison of their capabilities, with particular focus on scGPT's proficiency in multi-omics integration versus scFoundation's performance in drug response prediction, drawing upon recent benchmarking studies to inform researchers and drug development professionals.

Model Architectures and Pretraining Foundations

Core Architectural Differences

The specialized capabilities of scGPT and scFoundation stem from their distinct architectural designs and pretraining methodologies, which shape their effectiveness for different biological tasks.

scGPT utilizes a transformer architecture inspired by the Generative Pretrained Transformer (GPT) family, employing a unidirectional masked self-attention mechanism that iteratively predicts masked genes conditioned on known genes [1]. This design excels at generative tasks and multi-modal integration. The model comprises approximately 50 million parameters and was pretrained on around 33 million non-cancerous human cells [4]. A key strength of scGPT lies in its flexible input representation, which can incorporate diverse omics modalities—including scRNA-seq, scATAC-seq, CITE-seq, and spatial transcriptomics—through specialized tokenization strategies that bin expression values and use modality-specific tokens [1] [4].

scFoundation employs an asymmetric encoder-decoder architecture with approximately 100 million parameters, pretrained on roughly 50 million cells [4]. Unlike scGPT, it processes a comprehensive set of 19,264 human protein-encoding genes alongside common mitochondrial genes [4]. Its pretraining utilizes a read-depth-aware masked gene modeling objective with mean squared error loss, focusing on reconstructing gene expression values [4]. This design prioritizes capturing deep relationships within transcriptomics data rather than cross-modal integration.

Tokenization and Input Representation Strategies

Model	Input Gene Strategy	Value Embedding	Positional Embedding	Multi-omics Support
scGPT	1,200 highly variable genes (HVGs)	Value binning	Not used	Native support for multiple modalities
scFoundation	All 19,264 protein-encoding genes	Value projection	Not used	Primarily scRNA-seq focused

Table 1: Input representation strategies for scGPT and scFoundation

The models differ significantly in their tokenization approaches. scGPT uses highly variable genes and employs value binning to transform continuous expression values into discrete tokens, facilitating its transformer-based processing [4]. In contrast, scFoundation utilizes the complete set of protein-encoding genes with value projection, preserving more comprehensive genomic information but requiring more computational resources [4]. These fundamental differences in architecture and input representation establish the foundation for their divergent performance across specialized tasks.

Performance Benchmarking: Quantitative Comparisons

Multi-omics Integration Capabilities

scGPT demonstrates superior capabilities in integrating diverse molecular modalities, a critical requirement for comprehensive cellular analysis. Benchmarking studies consistently highlight scGPT's architectural advantages for multi-omics tasks. According to comprehensive evaluations, scGPT's design natively supports "scRNA-seq, scATAC-seq, CITE-seq, and spatial transcriptomics" through its flexible tokenization system [1] [4]. This enables researchers to jointly analyze gene expression, chromatin accessibility, and protein abundance within a unified representation space.

The BioLLM benchmarking framework, which provides standardized evaluation of multiple scFMs, identified scGPT as exhibiting "robust performance across all tasks," with particular strength in multi-omics integration scenarios [17]. This cross-modal capability stems from scGPT's use of modality-specific tokens and its value binning approach, which creates a standardized representation scheme across different data types [1]. When processing multi-omics data, scGPT can effectively leverage relationships between different molecular layers, enabling more holistic cellular state characterization.

Drug Response Prediction Performance

scFoundation shows specialized strength in drug response prediction, particularly in contexts with sufficient training data. The scDrugMap benchmarking study, which evaluated eight single-cell foundation models across 326,751 cells from 36 datasets, found that scFoundation "outperformed all others" in pooled-data evaluation for drug response prediction [21]. Specifically, scFoundation achieved the highest mean F1 scores of 0.971 and 0.947 using layer-freezing and fine-tuning strategies respectively, outperforming the lowest-performing model by 54% and 57% [21].

However, model performance varies significantly based on evaluation scenarios. In cross-data evaluation, where models are tested on completely independent datasets, UCE achieved the highest performance (mean F1 score: 0.774) after fine-tuning on tumor tissue, while scGPT demonstrated superior performance (mean F1 score: 0.858) in zero-shot learning settings [21]. This indicates that while scFoundation excels with adequate training data, scGPT may offer better generalization to novel drug compounds or cellular contexts without task-specific fine-tuning.

Comparative Performance Across Tasks

Task	Best Performing Model	Key Metric	Performance Context
Multi-omics Integration	scGPT	Qualitative assessment	Native architectural support
Drug Response (Pooled-data)	scFoundation	F1 Score	0.971 (layer-freezing)
Drug Response (Zero-shot)	scGPT	F1 Score	0.858 (cross-data)
Post-Perturbation Prediction	Simple Baselines	Pearson Delta	Outperforms both models
Batch Integration	scGPT (on complex batches)	Batch mixing scores	Outperforms scFoundation

Table 2: Task-specific performance comparison between scGPT and scFoundation

Recent independent benchmarking reveals important limitations for both models in certain applications. A critical evaluation published in Nature Methods found that for predicting transcriptome changes after genetic perturbations, "none outperformed the baselines," including deliberately simple additive and no-change models [14]. Similarly, a study in BMC Genomics reported that even the simplest baseline model—taking the mean of training examples—outperformed both scGPT and scFoundation for post-perturbation gene expression prediction [5].

Experimental Protocols and Methodologies

Multi-omics Integration Workflow

Figure 1: scGPT Multi-omics Integration Workflow

The multi-omics integration protocol using scGPT follows a standardized workflow (Figure 1). First, data from different modalities (scRNA-seq, scATAC-seq, spatial transcriptomics, CITE-seq) undergo modality-specific tokenization, where each modality is assigned special token identifiers [1] [4]. Expression or accessibility values are then processed through value binning, which discretizes continuous measurements into predefined ranges [4]. The tokenized sequences are concatenated into a unified input sequence with positional information, though scGPT does not use traditional positional embeddings [4].

The model processes this integrated sequence through its transformer architecture, employing masked self-attention to capture cross-modal relationships. During fine-tuning for specific integration tasks, the cell embedding (a specialized [CLS] token) is typically extracted as the integrated representation [1]. For evaluation, researchers commonly assess clustering purity, batch integration metrics, and the preservation of biological variance using standardized benchmarks like the AIDA v2 dataset [4].

Drug Response Prediction Methodology

Figure 2: Drug Response Prediction Experimental Workflow

The drug response prediction protocol follows rigorous benchmarking standards established by scDrugMap [21]. As shown in Figure 2, the process begins with curating single-cell expression matrices from drug-treated samples, with balanced representation of sensitive and resistant cells. The scFoundation model serves as a feature extractor, generating latent representations of each cell's transcriptional state [21].

Two evaluation scenarios are implemented: pooled-data evaluation (models trained and tested on aggregated data from multiple studies) and cross-data evaluation (models tested on completely independent datasets) [21]. For model adaptation, researchers employ either layer freezing (using scFoundation as a fixed feature extractor) or fine-tuning with Low-Rank Adaptation (LoRA), which updates a small subset of parameters [21]. Performance is assessed using F1 scores, AUROC, and accuracy, with particular emphasis on generalizability across different tissue types, cancer types, and treatment regimens [21].

Resource Type	Specific Examples	Function in Analysis
Data Resources	CZ CELLxGENE [1], Human Cell Atlas [1], GEO/SRA [1]	Provide standardized single-cell datasets for pretraining and benchmarking
Benchmarking Platforms	scDrugMap [21], BioLLM [17]	Offer standardized evaluation frameworks and metrics
Computational Tools	Low-Rank Adaptation (LoRA) [21], Layer Freezing [21]	Enable efficient model fine-tuning with limited data
Evaluation Metrics	F1 Score [21], Pearson Delta [5], Batch Integration Scores [16]	Quantify model performance across different tasks
Perturbation Datasets	Perturb-seq [5] [14], Norman et al. [14], Replogle et al. [14]	Provide ground truth for evaluating perturbation prediction

Table 3: Essential Research Resources for scFM Evaluation

Successful application of scGPT and scFoundation requires access to several key resources. Public data repositories like CZ CELLxGENE (containing over 100 million unique cells) and the Human Cell Atlas provide essential pretraining corpora and standardized datasets [1]. For drug response prediction, the scDrugMap resource offers curated collections of 326,751 primary cells and 18,856 validation cells with drug response annotations [21].

Computationally, Low-Rank Adaptation (LoRA) has emerged as a critical technique for efficient fine-tuning of both models, significantly reducing computational requirements while maintaining performance [21]. For rigorous evaluation, established perturbation datasets (Adamson, Norman, Replogle) serve as standard benchmarks, though recent studies caution about their limitations in capturing perturbation-specific variance [5] [14].

Based on comprehensive benchmarking evidence, scGPT represents the superior choice for researchers requiring flexible multi-omics integration and generalization to novel biological contexts without extensive fine-tuning. Its architectural advantages in handling diverse data modalities and strong zero-shot performance make it particularly valuable for exploratory research where labeling data is scarce or impossible.

Conversely, scFoundation demonstrates specialized excellence in drug response prediction when sufficient training data is available, particularly in pooled-data scenarios where its comprehensive gene coverage and architectural optimization for transcriptomics data yield state-of-the-art performance. However, researchers should note that simple baseline models can sometimes outperform both scGPT and scFoundation for specific tasks like perturbation prediction [5] [14], highlighting the importance of task-specific evaluation before committing to computational intensive approaches.

For optimal model selection, researchers should consider their specific data characteristics (modality, sample size), application context (known vs. novel perturbations), and computational resources. As the scFM field rapidly evolves, frameworks like BioLLM [17] and scDrugMap [21] provide essential standardized platforms for ongoing evaluation of these powerful but specialized tools in biological and clinical research.

In the rapidly evolving field of single-cell biology, foundation models like scGPT and scFoundation promise to revolutionize how we analyze cellular systems by learning universal patterns from massive datasets. However, their true capability and performance relative to each other and to simpler methods must be rigorously assessed using standardized evaluation metrics and frameworks. This comparison guide objectively examines the performance of scGPT and scFoundation within a broader benchmarking context, focusing on critical metrics such as Pearson correlation for perturbation prediction and integration metrics for data harmonization. Drawing on recent experimental evidence, we summarize quantitative data and detail methodological protocols to provide researchers, scientists, and drug development professionals with a clear, evidence-based resource for model selection.

Quantitative Performance Comparison

Independent benchmarking studies have consistently evaluated scGPT and scFoundation against various baseline models across multiple tasks and datasets. The tables below summarize key quantitative findings from these rigorous comparisons.

Table 1: Benchmarking performance of foundation models versus baselines on post-perturbation RNA-seq prediction (Pearson Delta metric)

Model	Adamson Dataset	Norman Dataset	Replogle K562	Replogle RPE1
Train Mean (Baseline)	0.711	0.557	0.373	0.628
scGPT	0.641	0.554	0.327	0.596
scFoundation	0.552	0.459	0.269	0.471
Random Forest with GO Features	0.739	0.586	0.480	0.648
Random Forest with scGPT Embeddings	0.727	0.583	0.421	0.635

Source: Adapted from [5]

Table 2: Zero-shot performance comparison on cell type clustering (Average BIO Score)

Model	Pancreas Dataset	Tabula Sapiens	Immune Dataset	PBMC (12k)
HVG (Baseline)	0.614	0.582	0.601	0.592
scVI	0.598	0.565	0.578	0.561
Harmony	0.587	0.554	0.572	0.550
scGPT	0.532	0.521	0.525	0.581
Geneformer	0.448	0.432	0.441	0.445

Source: Adapted from [16]

Table 3: Key model architecture and training specifications

Specification	scGPT	scFoundation
Parameters	53 million [42]	100 million [5]
Pretraining Dataset Size	33 million cells [10] [42]	50 million cells [5]
Architecture	Transformer [42]	Transformer [4]
Gene Embedding Strategy	Value binning [42]	Value projection [12]
Primary Pretraining Task	Masked gene modeling [42]	Read-depth-aware masked gene modeling [4]

Experimental Protocols and Methodologies

Benchmarking Framework for Perturbation Prediction

The evaluation of perturbation prediction capabilities follows a standardized protocol designed to assess model generalizability for unseen perturbations (Perturbation Exclusive or PEX setup) [5]. The core methodology involves:

Data Preparation: Using Perturb-seq datasets (e.g., Adamson, Norman, Replogle) generated via CRISPR-based perturbations (CRISPRi/CRISPRa) combined with single-cell sequencing. Data is partitioned to ensure that specific perturbations are held out from the training set for evaluation [5] [14].
Model Input Formulation:
- scGPT: Uses RNA-seq vectors from unperturbed cells combined with a perturbation token added to the perturbed gene token [5].
- scFoundation: Leverages pre-trained gene embeddings as inputs for graph neural-network architectures like GEARS [5].
Fine-tuning Protocol: Both foundation models are fine-tuned on the benchmark datasets according to their original publications' specifications before evaluation [5].
Evaluation Metrics:
- Pearson Correlation: Calculated between predicted and ground truth pseudo-bulk expression profiles. This is performed in both raw gene expression space and, more importantly, in differential expression space (Pearson Delta) which measures the model's ability to capture expression changes relative to control [5].
- Top DE Genes: Evaluation focused on the top 20 differentially expressed genes to assess capture of most significant transcriptional changes [5].
- L2 Distance: The L2 distance between predicted and observed expression values, particularly for the most highly expressed or differentially expressed genes [14].
Baseline Models:
- Simple Baselines: "No change" model (predicts control expression) and "additive" model (sums individual logarithmic fold changes for combinatorial perturbations) [14].
- Train Mean: Predicts the average pseudo-bulk expression profile from the training dataset [5].
- Traditional ML Models: Elastic-Net Regression, k-Nearest-Neighbors Regression, and Random Forest Regressor using biological features like Gene Ontology vectors or foundation model embeddings [5].

Figure 1: Workflow for perturbation prediction benchmarking

Zero-Shot Evaluation Framework

The assessment of zero-shot capabilities focuses on model performance without task-specific fine-tuning, which is critical for exploratory biological applications where labels are unknown [16]. The methodology includes:

Embedding Extraction: Generating cell embeddings from pre-trained foundation models without additional fine-tuning [16].
Cell Type Clustering Task:
- Procedure: Applying foundation model embeddings to separate known cell types across multiple datasets [16].
- Evaluation Metrics: Average BIO score and Average Silhouette Width (ASW) to quantify clustering quality [16].
- Baselines: Comparison against established methods including Highly Variable Genes (HVG) selection, Harmony, and scVI [16].
Batch Integration Task:
- Objective: Assess the model's ability to remove technical batch effects while preserving biological variation [16].
- Datasets: Using benchmark datasets with known batch effects, such as the Pancreas dataset comprising data from five different sources [16].
- Evaluation: Qualitative visualization and quantitative metrics for batch mixing and principal component regression (PCR) scores [16].

Signaling Pathways and Biological Interpretation

The biological relevance of foundation models can be assessed by examining how well their internal representations align with established biological knowledge, particularly in the context of gene regulatory networks and signaling pathways.

Gene Embedding Analysis: Studies have compared the similarity of gene embeddings from foundation models against known biological relationships, including shared biological pathways (KEGG, REACTOME) and gene regulatory networks (CollecTRI) [5]. Random Forest models using biological prior knowledge (Gene Ontology vectors) consistently outperform foundation models, suggesting limitations in the biological meaningfulness of the learned representations [5].
Genetic Interaction Prediction: Models are evaluated on their ability to predict non-additive genetic interactions, categorized as "buffering," "synergistic," or "opposite" effects [14]. Current foundation models predominantly predict buffering interactions and rarely correctly identify synergistic interactions [14].

Figure 2: Comprehensive evaluation framework for foundation models

The Scientist's Toolkit

Table 4: Essential research reagents and computational resources for foundation model benchmarking

Resource	Type	Function in Benchmarking
Perturb-seq Datasets (Adamson, Norman, Replogle)	Biological Data	Provide ground truth measurements of post-perturbation gene expression profiles for model training and evaluation [5] [14].
Gene Ontology (GO) Vectors	Biological Knowledge Base	Serve as biologically meaningful features for baseline machine learning models [5].
CRISPR Interference (CRISPRi)	Molecular Tool	Enables precise genetic perturbations in experimental datasets used for benchmarking [5].
Hubble / Harmony	Computational Method	Established baselines for batch integration tasks in zero-shot evaluations [16].
scVI	Computational Method	Generative model used as a baseline for batch correction and data integration [16].
Random Forest Regressor	Machine Learning Algorithm	Provides simple yet strong baseline when equipped with biological features [5].
Highly Variable Genes (HVG)	Feature Selection Method	Standard approach for selecting informative genes, used as a competitive baseline [16].

The comprehensive benchmarking of scGPT and scFoundation reveals a nuanced performance landscape. For perturbation prediction, both foundation models are consistently outperformed by simpler baseline approaches, with the Train Mean baseline exceeding scGPT and scFoundation across all four benchmark datasets, and Random Forest models using biological features achieving superior Pearson Delta metrics [5]. In zero-shot evaluation for cell type annotation and batch integration, both models demonstrate limitations, with scGPT showing variable performance across datasets and Geneformer consistently underperforming established methods [16].

These findings highlight critical considerations for researchers and drug development professionals: foundation models present promising frameworks but have not yet consistently surpassed simpler, more interpretable methods in key tasks. Model selection should therefore be guided by specific task requirements, dataset characteristics, and available computational resources, rather than assuming superior performance from more complex foundation architectures. Future development should focus on improving the biological meaningfulness of learned representations and enhancing zero-shot capabilities for truly exploratory biological discovery.

Conclusion

The benchmarking of scGPT and scFoundation reveals a nuanced landscape where neither model universally dominates. scFoundation demonstrates superior performance in specific, well-defined tasks like drug response prediction, while scGPT shows stronger generalization capabilities in cross-data and zero-shot settings. Critical limitations identified include inconsistent zero-shot performance, vulnerability to batch effects, and surprising underperformance against simpler models using biological prior knowledge. These findings underscore that model selection must be task-specific and highlight the urgent need for more robust benchmarking datasets and standardized evaluation frameworks. Future development should focus on improving pretraining objectives for better zero-shot generalization, creating more challenging benchmarks, and developing hybrid approaches that combine the strengths of foundation models with established biological knowledge. For biomedical research, the strategic integration of these models holds significant potential to accelerate drug discovery and personalized medicine, provided their current limitations are acknowledged and addressed.

Benchmarking scGPT vs. scFoundation: A Comprehensive Performance Analysis for Single-Cell Biology

Benchmarking scGPT vs. scFoundation: A Comprehensive Performance Analysis for Single-Cell Biology

Abstract

Understanding scGPT and scFoundation: Core Architectures and Pretraining Paradigms

Defining Single-Cell Foundation Models (scFMs) and Their Role in Biology

Table of Contents

Head-to-Head Performance Comparison

Experimental Protocols for Benchmarking

Benchmarking Workflow

Key Evaluation Metrics

Visualizing the scFM Workflow

The Scientist's Toolkit: Essential Research Reagents & Materials

Architectural and Methodological Deep Dive

Core Architectures of scGPT and scFoundation

Experimental Workflow for Benchmarking Foundation Models

Performance Benchmarking Across Key Tasks

Cell-Level Tasks: Embedding and Annotation

Perturbation Response Prediction

Drug Response Prediction

The Scientist's Toolkit: Essential Research Reagents

A Benchmarking Perspective on Single-Cell Foundation Models

Model Architecture & Pretraining

Performance Benchmarking Across Key Tasks

Deep Dive: Perturbation Prediction Benchmark

The Scientist's Toolkit

Model Architectures and Tokenization Strategies

Tokenization Approaches in Practice

Benchmarking Performance on Perturbation Prediction

Experimental Protocols for Perturbation Prediction

Quantitative Performance Comparison

Zero-Shot Performance and Batch Integration

Experimental Protocols for Zero-Shot Evaluation

Zero-Shot Performance Results

The Scientist's Toolkit: Essential Research Reagents

Benchmarking Performance: scGPT vs. scFoundation

Experimental Protocols in Benchmarking Studies

The Scientist's Toolkit: Key Research Reagents & Solutions

Practical Applications: From Drug Response to Perturbation Prediction

Comparative Analysis of Model Performance

Pooled-Data Evaluation

Cross-Data Evaluation

Critical Perspectives on Foundation Model Performance

scDrugMap Experimental Framework

Datasets and Curation

Model Training and Adaptation Strategies

Evaluation Metrics and Protocol

Research Reagent Solutions

Predicting Cellular Responses to Genetic Perturbations with Perturb-Seq Data

Performance Comparison: Foundation Models vs. Baseline Approaches

Detailed Experimental Protocols for Benchmarking

Evaluation Metrics and Task Formulation

Model Training and Fine-tuning

Visualizing the Benchmarking Workflow

Performance Benchmarking: Quantitative Comparison

Perturbation Response Prediction

Zero-Shot Capabilities and Generalizability

Experimental Protocols and Methodologies

Benchmarking Framework Design

Model Integration Strategies

Technical Specifications and Implementation

Architectural Comparison

Research Reagent Solutions

Integration Recommendations for DeepCDR

Practical Implementation Guidelines

Limitations and Alternative Approaches

Performance Comparison Tables

Experimental Protocols

Benchmarking Perturbation Prediction

Benchmarking Zero-Shot Cell-Level Tasks

Benchmarking Gene Network Inference

Model Architectures and Workflows

The Scientist's Toolkit: Research Reagent Solutions

Navigating Limitations and Enhancing Model Performance

Experimental Frameworks for Zero-Shot Evaluation

Benchmarking Design Principles

Key Methodological Approaches

Comparative Performance Analysis

Perturbation Prediction Capabilities

Zero-Shot Capabilities for Cell Type Identification and Batch Integration

Architectural and Methodological Limitations