This article provides a systematic evaluation of two leading single-cell foundation models, scGPT and scFoundation, based on the latest benchmarking studies.
This article provides a systematic evaluation of two leading single-cell foundation models, scGPT and scFoundation, based on the latest benchmarking studies. It explores their foundational concepts and architectures, examines their methodological applications in tasks like drug response prediction and perturbation modeling, identifies key performance limitations and optimization strategies, and delivers a rigorous comparative analysis across multiple biological contexts. Aimed at researchers, scientists, and drug development professionals, this review synthesizes critical insights to guide model selection and application, highlighting current challenges and future directions for integrating AI into biomedical research.
Single-cell Foundation Models (scFMs) are large-scale deep learning models, typically based on transformer architectures, that are pretrained on vast datasets of single-cell RNA sequencing (scRNA-seq) data [1]. The core concept draws an analogy from natural language processing: treating a cell as a "sentence" and its constituent genes as "words" [1] [2]. By training on millions of cells across diverse tissues, conditions, and species, these models aim to learn fundamental principles of cellular biology and gene-gene interactions in a self-supervised manner [3] [1]. This pretraining allows scFMs to develop rich, internal representations of biological knowledge, which can then be adapted—or fine-tuned—for a wide array of downstream tasks without the need to train a new model from scratch for each specific application [1].
The emergence of scFMs addresses critical challenges in single-cell data analysis, including the characteristically high sparsity, high dimensionality, and technical noise of transcriptome data [3] [4]. They offer a promising unified framework for integrating and comprehensively analyzing the rapidly expanding repositories of single-cell data [1]. Two prominent examples of such models are scGPT and scFoundation, which have been the subject of extensive benchmarking studies to evaluate their respective strengths and limitations [3] [5] [6].
Comprehensive benchmarking reveals that no single scFM consistently outperforms all others across every possible task [3] [6]. Model performance is highly dependent on the specific downstream application, dataset size, and the biological question being asked. The following tables summarize the comparative performance of scGPT and scFoundation across key biological tasks, based on recent, rigorous evaluations.
Table 1: Performance Comparison on Cell-Level Tasks
| Task | Description | scGPT Performance | scFoundation Performance | Key Findings |
|---|---|---|---|---|
| Cell Type Annotation | Classifying cell identity from gene expression. | Superior in zero-shot settings; achieves better cell type separation in embeddings [6]. | Competitive, but generally outperformed by scGPT in independent benchmarks [6]. | scGPT's architecture is particularly proficient at preserving biologically relevant information, enhancing cell type clustering [6]. |
| Batch Integration | Correcting for technical variations between datasets. | Superior at removing batch effects while preserving biological variation in zero-shot tasks [3] [6]. | Effective at distinguishing certain cell types, but generally less effective at batch correction than scGPT [6]. | A unified framework found scGPT outperformed other models, including scFoundation, on batch-effect-removal metrics [6]. |
| Cancer Cell Identification | Identifying malignant cells within a tumor microenvironment. | Robust and versatile performance across diverse applications and cancer types [3]. | Robust and versatile performance across diverse applications and cancer types [3]. | Both models demonstrated utility in this clinically relevant task, with no single model being a clear winner in all contexts [3]. |
Table 2: Performance Comparison on Gene-Level and Perturbation Tasks
| Task | Description | scGPT Performance | scFoundation Performance | Key Findings |
|---|---|---|---|---|
| Perturbation Prediction | Predicting gene expression changes after a genetic or chemical intervention. | Underperformed compared to simpler baseline models (e.g., Random Forest with GO features) [5]. | Underperformed compared to simpler baseline models, including the "Train Mean" baseline [5]. | A key study found that even the simplest baseline model (predicting the mean of training data) could outperform these foundation models on certain Perturb-seq benchmarks [5]. |
| Gene Function Prediction | Inferring gene function and relationships from embeddings. | Strong capabilities, benefiting from effective pretraining strategies [6]. | Strong capabilities in gene-level tasks [6]. | Both models automatically learn a gene embedding matrix that can be leveraged for predicting biological relationships [3] [6]. |
The performance data presented in the previous section are derived from standardized benchmarking frameworks designed to ensure a fair and rigorous comparison. The core methodology involves a "zero-shot" or "fine-tuning" evaluation of the model's learned representations on specific, held-out downstream tasks [3] [6].
A typical benchmarking pipeline involves several critical stages:
Benchmarking studies employ a diverse set of metrics to holistically assess model performance [3]:
The diagram below illustrates the standard lifecycle of a single-cell Foundation Model, from pretraining on large-scale data to application on downstream biological tasks.
Lifecycle of a Single-Cell Foundation Model
Successfully applying and benchmarking scFMs requires a combination of computational tools, software frameworks, and curated biological data resources. The following table details key components of the modern computational biologist's toolkit for working with models like scGPT and scFoundation.
Table 3: Essential Resources for scFM Research
| Category | Item / Tool | Function & Description |
|---|---|---|
| Software & Frameworks | BioLLM | A unified framework that standardizes the deployment of various scFMs (like scGPT and scFoundation) through consistent APIs, enabling seamless model switching and comparative benchmarking [6]. |
| Data Resources | CZ CELLxGENE | A curated atlas and database that provides unified access to millions of annotated single-cell datasets, often used for model pretraining and as a source of high-quality, independent validation data [3] [1]. |
| Data Resources | Perturb-seq Datasets | High-throughput single-cell datasets combining CRISPR-based genetic perturbations with sequencing. They serve as the primary benchmark for evaluating a model's ability to predict cellular responses to genetic interventions [5]. |
| Baseline Models | Traditional ML Models (e.g., RF, kNN) | Simple machine learning models like Random Forest (RF) and k-Nearest Neighbors (kNN) are used as critical baselines. They help determine if the complexity of a foundation model provides a tangible performance benefit for a given task [3] [5]. |
| Evaluation Metrics | Cell Ontology-Informed Metrics (e.g., LCAD) | Novel metrics that incorporate prior biological knowledge from cell ontologies to assess whether model errors are biologically reasonable (e.g., misclassifying a T-cell as a B-cell is less severe than misclassifying it as a neuron) [3] [4]. |
| Gene Embedding Baselines | Functional Representation of Gene Signatures (FRoGS) | An alternative method for generating gene embeddings via random walks on a biological hypergraph. Used as a baseline to evaluate the quality of gene representations learned by scFMs [3]. |
Single-cell RNA sequencing (scRNA-seq) has revolutionized biology by allowing researchers to probe cellular heterogeneity at an unprecedented resolution. The emergence of single-cell foundation models (scFMs), inspired by breakthroughs in large language models (LLMs), represents a paradigm shift in how this complex data is analyzed. These models, pre-trained on massive collections of single-cell data, aim to learn universal patterns of cellular biology that can be adapted to diverse downstream tasks. Among these, scGPT and scFoundation have emerged as prominent transformer-based models. This guide provides an objective comparison of their performance, underpinned by experimental data from recent benchmarking studies, to inform researchers and drug development professionals about their respective strengths and limitations.
The design philosophies of scGPT and scFoundation, while both rooted in transformer architecture, differ in ways that influence their capabilities and performance.
scGPT leverages a generative pre-trained transformer architecture, specifically designed to handle the non-sequential nature of gene expression data [1] [7]. Its input processing creates a composite embedding for each gene by combining its identity (a unique gene token) and its expression value (often binned into discrete values) [8]. A key innovation is its use of a specialized attention mask within its transformer blocks, which allows for generative pre-training on gene expression profiles without relying on a fixed gene order [2] [7]. scGPT was pre-trained on a massive corpus of over 33 million human cells from 51 organs and 441 studies, collated from the CELLxGENE collection [9] [10] [2].
scFoundation, in contrast, employs an asymmetric encoder-decoder architecture [4]. It is designed to process a much larger input gene set, encompassing nearly all ~19,000 human protein-encoding genes along with common mitochondrial genes [5] [4]. Its pre-training strategy incorporates a read-depth-aware masked gene modeling (MGM) objective, using a mean squared error (MSE) loss to reconstruct masked gene expressions [4].
Table: Architectural Comparison of scGPT and scFoundation
| Feature | scGPT | scFoundation |
|---|---|---|
| Core Architecture | GPT-like (Decoder-based) | Asymmetric Encoder-Decoder |
| Model Parameters | ~50 million [4] | ~100 million [4] |
| Pre-training Dataset Size | ~33 million cells [4] [10] | ~50 million cells [4] |
| Input Gene Handling | ~1,200 Highly Variable Genes (HVGs) [4] | ~19,264 protein-encoding genes [4] |
| Value Embedding | Expression value binning [8] | Value projection [4] |
| Positional Embedding | Not used [4] | Not used [4] |
The evaluation of scFMs like scGPT and scFoundation follows a structured pipeline to ensure fair and informative comparisons. The following diagram visualizes a typical benchmarking workflow as implemented in frameworks like BioLLM [4] [6].
Diagram: Benchmarking Workflow for Single-Cell Foundation Models
A primary application of scFMs is to generate meaningful representations (embeddings) of cells that capture biological state, which is crucial for tasks like cell type annotation and batch integration.
In a comprehensive benchmark by BioLLM, which evaluated zero-shot cell embeddings using metrics like Average Silhouette Width (ASW) to measure cluster purity, scGPT consistently outperformed other models, including scFoundation [6]. scGPT's embeddings provided superior separation of cell types in visualizations and demonstrated greater effectiveness in integrating data across batches, though it, like other models, struggled to correct for strong batch effects across different sequencing technologies [6]. Another independent study confirmed that fine-tuned scGPT outperformed Geneformer in cell type annotation, though it noted that inconsistent results across studies highlight the importance of proper adaptation techniques like Parameter-Efficient Fine-Tuning (PEFT) [8].
Predicting cellular transcriptional responses to genetic perturbations is a rigorous test of a model's grasp of gene regulatory mechanics. A dedicated benchmarking study yielded surprising results [5].
The study evaluated models on their ability to predict post-perturbation gene expression profiles (in differential expression space) across four Perturb-seq datasets. The results demonstrated that even a simple baseline model (Train Mean), which predicts the average expression profile from the training data, could outperform the fine-tuned foundation models. More notably, a Random Forest (RF) regressor using prior biological knowledge like Gene Ontology (GO) vectors outperformed scGPT by a large margin [5].
Table: Performance in Perturbation Prediction (Pearson Correlation in Differential Expression Space)
| Model | Adamson Dataset | Norman Dataset | Replogle (K562) | Replogle (RPE1) |
|---|---|---|---|---|
| Train Mean (Baseline) | 0.711 | 0.557 | 0.373 | 0.628 |
| scGPT (Fine-tuned) | 0.641 | 0.554 | 0.327 | 0.596 |
| scFoundation (Fine-tuned) | 0.552 | 0.459 | 0.269 | 0.471 |
| RF with GO Features | 0.739 | 0.586 | 0.480 | 0.648 |
| RF with scGPT Embeddings | 0.727 | 0.583 | 0.421 | 0.635 |
An important finding was that using the pre-trained embeddings from scGPT as features for a Random Forest model led to better performance than using the fine-tuned scGPT model itself, though it still fell short of the RF model with GO features [5]. This suggests that while scGPT's embeddings contain valuable biological information, the full fine-tuning pipeline may not be leveraging it optimally for this specific task.
The application of scFMs to predict patient-specific or cell-specific responses to drugs is highly relevant for therapeutic development. The scDrugMap benchmark provides insights here [11].
In pooled-data evaluation (training and testing on mixed datasets), scFoundation achieved the best performance, with a mean F1 score of 0.971, outperforming other models by a significant margin [11]. However, in the more challenging cross-data evaluation (testing on datasets not seen during training), which better assesses model generalizability, scGPT excelled in zero-shot learning (mean F1: 0.858), while another model, UCE, performed best after fine-tuning [11]. This indicates a trade-off: scFoundation may achieve higher peak performance on familiar data distributions, while scGPT shows stronger inherent generalization in some contexts.
Benchmarking studies rely on a suite of computational tools and data resources. The table below details key components used in the evaluations discussed.
Table: Key Reagents for Single-Cell Foundation Model Research
| Reagent / Resource | Type | Function in Research |
|---|---|---|
| Perturb-seq Datasets [5] | Experimental Data | Provides ground-truth data (genetic perturbation + scRNA-seq) for evaluating model predictions of causal cellular responses. |
| CELLxGENE Atlas [9] [2] | Data Repository | A primary source of millions of curated single-cell datasets used for pre-training and as a reference for model applications. |
| BioLLM Framework [6] | Software Tool | A unified framework that standardizes the integration, fine-tuning, and evaluation of different scFMs, ensuring fair comparisons. |
| Gene Ontology (GO) Vectors [5] | Prior Knowledge | Structured, biologically grounded feature sets used to build powerful baseline models for tasks like perturbation prediction. |
| Parameter-Efficient Fine-Tuning (PEFT) [8] | Computational Method | Adaptation techniques like LoRA that efficiently tailor large scFMs to new tasks, reducing computational cost and catastrophic forgetting. |
The benchmarking data reveals that the competition between scGPT and scFoundation is not a simple matter of one being universally superior. Instead, each model demonstrates distinct strengths, a finding consistent with a broader benchmark concluding that "no single scFM consistently outperforms others across all tasks" [4].
A critical insight from the perturbation prediction benchmarks is that foundation models do not always outperform simpler, biologically-informed baseline models [5]. This highlights the necessity of including such baselines in evaluations to properly assess the value added by these large-scale models.
For researchers, the choice between scGPT, scFoundation, or a simpler alternative should be guided by the specific downstream task, dataset size, available computational resources, and the need for biological interpretability. As the field matures, standardized frameworks like BioLLM and continued rigorous benchmarking will be essential for guiding the effective application of these powerful tools in biological discovery and drug development.
The development of single-cell foundation models (scFMs) represents a significant push in computational biology, aiming to leverage large-scale datasets to build models that can generalize across diverse biological tasks. Among these, scFoundation is a prominent model that utilizes an asymmetric transformer architecture and was pre-trained on approximately 50 million human cells [12] [13] [4]. This guide objectively situates scFoundation's performance within the competitive landscape of single-cell foundation models, focusing on direct comparisons with alternatives like scGPT, Geneformer, and UCE, based on recent, rigorous benchmarking studies.
The performance of any foundation model is fundamentally shaped by its architectural choices and the scale of its training data.
scFoundation: Employs an asymmetric encoder-decoder transformer architecture and is categorized as a value projection-based model [4]. This approach aims to preserve the full resolution of gene expression data by directly predicting raw expression values. Its pretraining was conducted on a corpus of around 50 million human cells, resulting in a model with ~100 million parameters [12] [4]. The pretraining task was a read-depth-aware masked gene modeling (MGM) objective, optimized using a Mean Squared Error (MSE) loss [4].
scGPT: Also a value projection model, scGPT uses a standard transformer encoder architecture and incorporates an attention mask mechanism [13] [4]. It segments gene expression values into bins, treating the prediction as a regression task. It was pretrained on over 33 million human cells (non-cancerous) and has ~50 million parameters [12] [4]. Its pretraining combines both generative objectives and iterative MGM.
Geneformer: This model adopts a different strategy, based on the ordering of genes by expression level [13] [4]. It is a rank-based model that learns by predicting gene positions within a cell's context. Geneformer was trained on 30 million cells from humans and mice and has a smaller architecture with 40 million parameters [13] [4]. Its pretraining uses MGM with a Cross-Entropy (CE) loss for gene identity prediction.
UCE (Universal Cell Embedding): Distinguished by its use of protein language model embeddings from ESM-2 as gene representations, UCE is a massive model with 650 million parameters [14] [4]. It was pretrained on 36 million cells and uses a modified MGM task with a binary cross-entropy loss to predict whether a gene is expressed or not [4].
The following diagram summarizes the core pretraining workflow common to these models, highlighting key steps like tokenization and the masked gene modeling objective.
Independent benchmarks have revealed that no single model consistently dominates across all tasks. The table below summarizes the comparative performance of scFoundation against its peers in several critical applications.
| Task | Top Performing Model(s) | scFoundation's Performance & Notes |
|---|---|---|
| Perturbation Response Prediction | Random Forest with GO features, Additive baseline model [5] [14] | Underperformed against a simple baseline that predicts the mean of training data [5] [14]. |
| Drug Response Prediction | scFoundation (pooled-data), UCE (cross-data) [11] | Achieved top performance (mean F1: 0.971) when data is pooled; less dominant in cross-data settings [11]. |
| Zero-Shot Cell Type Clustering | HVG selection, scVI, Harmony [15] [16] | Not among top performers; simpler methods like Highly Variable Genes (HVG) selection outperformed foundation models [15] [16]. |
| Zero-Shot Batch Integration | HVG selection, scVI, Harmony [15] [16] | Not among top performers. Geneformer consistently ranked last, while scGPT showed mixed results [15] [16]. |
| Gene Function Prediction | CellFM, scGPT [12] [17] | CellFM, a newer model, reported improvements. scGPT also showed strong capabilities [12] [17]. |
Predicting a cell's transcriptomic response to a genetic perturbation is a key test for a model's understanding of regulatory biology. Recent benchmarks have yielded critical insights.
Experimental Protocol [5] [14]:
The logical flow of this benchmarking process is outlined below.
The following table details key resources and computational tools referenced in the benchmarking of single-cell foundation models.
| Research Reagent / Resource | Function in Evaluation |
|---|---|
| Perturb-seq Datasets (Adamson, Norman, Replogle) | Provides ground-truth experimental data for benchmarking genetic perturbation prediction models [5] [14]. |
| Gene Ontology (GO) Annotations | A source of biologically meaningful features used in simple baseline models (e.g., Random Forest) to compete against foundation models [5]. |
| BioLLM Framework | A unified software framework that provides standardized APIs for integrating and evaluating different scFMs, ensuring fair comparisons [17]. |
| Highly Variable Genes (HVG) | A simple, traditional feature selection method in single-cell analysis that serves as a strong baseline in zero-shot tasks like clustering and batch correction [15] [16]. |
| Harmony & scVI | Established, specialized algorithms for single-cell data integration (batch correction) and analysis. Used as baseline benchmarks for cell-level tasks [15] [4] [16]. |
In conclusion, benchmarking studies reveal a nuanced picture of the current capabilities of scFoundation and its peers. While these models represent a significant architectural achievement, they do not consistently outperform simpler, often biologically-informed, baseline methods on critical tasks like perturbation prediction and zero-shot analysis [5] [14] [15]. This indicates that the goal of a generalized, out-of-the-box foundation model that fully captures the complexity of cellular biology remains an active challenge.
The choice of model is highly task-dependent. For instance, while scFoundation excelled in one drug response prediction benchmark [11], it was less competitive in perturbation prediction [5] [14]. The field is maturing with the development of standardized evaluation frameworks like BioLLM [17], which will be crucial for guiding future development. The path forward likely involves not only scaling model and dataset size but also more effectively integrating prior biological knowledge to build models that offer robust, generalizable, and biologically plausible predictions.
In single-cell RNA sequencing (scRNA-seq) data, genes do not possess a natural sequential order, unlike words in a sentence. This fundamental difference presents a significant challenge for applying transformer-based architectures, which were originally designed for sequentially ordered text. Treating a cell's gene expression profile as a "sentence" requires researchers to impose an artificial sequence, a process known as tokenization. How different foundation models approach this tokenization problem directly impacts their ability to capture biological relationships and predict cellular behavior.
This guide objectively compares the performance of two prominent single-cell foundation models—scGPT and scFoundation—within the broader context of benchmarking research. By examining their tokenization strategies, architectural implementations, and experimental outcomes, we provide researchers and drug development professionals with critical insights for model selection in biological discovery and therapeutic applications.
Table 1: Fundamental Characteristics of scGPT and scFoundation
| Characteristic | scGPT | scFoundation |
|---|---|---|
| Primary Architecture | GPT-like decoder | Asymmetric encoder-decoder |
| Model Parameters | ~50 million | ~100 million |
| Pretraining Dataset Size | ~33 million cells | ~50 million cells |
| Input Gene Capacity | 1,200 highly variable genes (HVGs) | 19,264 protein-encoding genes |
| Value Representation | Value binning | Value projection |
| Positional Embedding | Not used | Not used |
| Gene Symbol Embedding | Lookup Table (512 dimensions) | Lookup Table (768 dimensions) |
Tokenization strategies differ markedly between models, significantly influencing their biological representations:
scGPT employs a highly variable gene selection approach, focusing computational resources on 1,200 genes with the most variable expression across cells. It uses value binning to transform continuous expression values into discrete tokens and does not incorporate positional embeddings, treating the gene set as a bag-of-words rather than an ordered sequence [4].
scFoundation utilizes a comprehensive gene representation, incorporating nearly all protein-encoding genes. This provides a more complete biological picture but increases computational complexity. Like scGPT, it foregoes positional embeddings, instead using a value projection system to handle continuous expression data [5] [4].
Benchmarking studies employed standardized experimental protocols to evaluate model performance on predicting transcriptomic changes after genetic perturbations:
Datasets: Models were evaluated on multiple Perturb-seq datasets, including Adamson (68,603 cells with CRISPRi), Norman (91,205 cells with CRISPRa), and Replogle (K562 and RPE1 cell lines, ~162,000 cells each) [5].
Training Setup: Foundation models were fine-tuned according to authors' specifications using a perturbation-exclusive (PEX) setup, where models were evaluated on their ability to predict effects of completely unseen perturbations [5].
Baseline Models: Simple baseline models including Train Mean (predicting average expression from training data), Elastic-Net Regression, k-Nearest Neighbors, and Random Forest regressors were implemented for comparison [5].
Evaluation Metrics: Performance was assessed using Pearson correlation in differential expression space (Pearson Delta) and accuracy in predicting top 20 differentially expressed genes, with pseudo-bulk profiles created by averaging single-cell predictions [5].
Table 2: Performance Comparison on Perturbation Prediction (Pearson Delta)
| Model | Adamson | Norman | Replogle K562 | Replogle RPE1 |
|---|---|---|---|---|
| Train Mean (Baseline) | 0.711 | 0.557 | 0.373 | 0.628 |
| scGPT | 0.641 | 0.554 | 0.327 | 0.596 |
| scFoundation | 0.552 | 0.459 | 0.269 | 0.471 |
| Random Forest + GO | 0.739 | 0.586 | 0.480 | 0.648 |
| Random Forest + scGPT Embeddings | 0.727 | 0.583 | 0.421 | 0.635 |
Performance Gap: Simple baseline models consistently outperformed both foundation models across all datasets. The Train Mean baseline achieved superior Pearson Delta correlation values (0.711, 0.557, 0.373, 0.628) compared to scGPT (0.641, 0.554, 0.327, 0.596) and scFoundation (0.552, 0.459, 0.269, 0.471) across the four benchmark datasets respectively [5].
Biological Prior Knowledge Integration: Random Forest models incorporating Gene Ontology (GO) features substantially outperformed all foundation models (0.739, 0.586, 0.480, 0.648), suggesting that explicit biological knowledge may be more valuable than representations learned through pretraining [5].
Embedding Utility: When scGPT's pretrained gene embeddings were used as features in Random Forest models, performance improved over the fine-tuned scGPT model itself (0.727 vs. 0.641 on Adamson dataset), indicating that the embeddings capture biologically relevant information that may be underutilized by scGPT's native architecture [5].
Evaluation Setting: Models were assessed without any task-specific fine-tuning to measure the generalizable biological knowledge acquired during pretraining [15].
Tasks: Cell type clustering and batch integration across multiple datasets including Tabula Sapiens, Pancreas, and PBMC datasets [15].
Baselines: Compared against established methods including Highly Variable Genes (HVG) selection, Harmony, and scVI [15].
Metrics: Average BIO score (AvgBio) for clustering quality and batch integration metrics (PCR) for technical variation removal [15].
Table 3: Zero-Shot Performance Comparison Across Tasks
| Model | Cell Type Clustering (AvgBio) | Batch Integration (PCR) | Generalization to Unseen Data |
|---|---|---|---|
| scGPT | Variable, outperformed by HVG and scVI on most datasets | Moderate, outperforms Harmony and scVI on some complex datasets | Inconsistent, no clear advantage over baselines |
| scFoundation | Not fully evaluated in zero-shot setting | Not fully evaluated in zero-shot setting | Limited evaluation available |
| Geneformer | Consistently outperformed by all baselines | Poor, shows inadequate batch mixing | Fails to generalize effectively |
| HVG (Baseline) | Best performing across most datasets | Best batch integration scores | Consistent performance across datasets |
Cell Type Clustering: In zero-shot settings, both scGPT and Geneformer were generally outperformed by simple Highly Variable Genes selection and established methods like Harmony and scVI across multiple datasets. HVG achieved the best clustering performance, indicating that foundation models do not necessarily provide superior cell embeddings without fine-tuning [15].
Batch Integration: scGPT showed mixed results, outperforming Harmony and scVI on complex datasets with both technical and biological batch effects, but underperforming on datasets with purely technical variation. Geneformer consistently ranked last in batch integration capabilities, with its embeddings often showing higher batch effect retention than the original data [15].
Pretraining Impact: Evaluation of different scGPT variants (random, kidney-specific, blood-specific, human) demonstrated that pretraining provides clear improvements over random initialization, but larger and more diverse pretraining datasets do not consistently confer additional benefits, suggesting diminishing returns to scale [15].
Table 4: Key Computational Tools and Resources for scFM Research
| Resource | Type | Primary Function | Relevance to Tokenization |
|---|---|---|---|
| Perturb-seq Datasets | Experimental Data | Provides ground truth for perturbation effects | Enables evaluation of tokenization strategies on functional outcomes |
| Gene Ontology (GO) | Biological Database | Structured knowledge base of gene functions | Provides biological priors for comparison with learned representations |
| CZ CELLxGENE | Data Platform | Standardized access to >100M single cells | Pretraining resource for developing tokenization approaches |
| HVG Selection | Computational Method | Identifies genes with high variability | Basis for scGPT's token reduction strategy |
| Random Forest Regression | Machine Learning Model | Baseline for prediction tasks | Tests biological relevance of gene embeddings independent of transformer architecture |
| Pearson Delta Metric | Evaluation Metric | Correlates predicted vs. actual differential expression | Quantifies performance of different tokenization schemes |
The benchmarking data reveals several critical considerations for researchers and drug development professionals:
Simplicity Versus Complexity: Simple baseline models consistently match or outperform sophisticated foundation models in perturbation prediction tasks. The "Train Mean" baseline surprisingly exceeded both scGPT and scFoundation performance, suggesting that current foundation models may not be capturing meaningful perturbation-specific signals beyond basic averaging approaches [5] [14].
Tokenization Impact: The choice of tokenization strategy significantly influences model performance. scGPT's focused approach using 1,200 highly variable genes demonstrates that careful gene selection may be more important than comprehensive gene inclusion, as implemented in scFoundation with 19,264 genes [5] [4].
Embedding Utility Versus Architecture: The strong performance of Random Forest models using scGPT's embeddings suggests that the pretrained gene representations capture biologically meaningful information, but this potential may not be fully leveraged within the transformer architecture itself [5].
Zero-Shot Limitations: Both scGPT and Geneformer show inconsistent zero-shot performance, indicating that their pretraining objectives may not optimally align with downstream biological tasks without fine-tuning [15].
For researchers selecting models for drug development applications, these findings suggest that foundation models should be evaluated against simple baselines specific to each use case. While scGPT and scFoundation represent significant engineering achievements, their practical advantage over simpler, more interpretable methods remains uncertain for critical applications like perturbation prediction. Future development should focus on better alignment between tokenization strategies, pretraining objectives, and biologically meaningful outcomes.
In the development of single-cell foundation models (scFMs) like scGPT and scFoundation, the choice of pretraining data is a fundamental determinant of model performance. These models are trained on vast collections of single-cell data to learn the "language of cells," with the goal of generating accurate predictions for downstream tasks, such as forecasting cellular responses to genetic perturbations [1]. However, recent rigorous benchmarks reveal a surprising trend: these complex models often fail to outperform simple baseline methods on key predictive tasks [18] [14]. This guide objectively compares the major data sources and examines the experimental evidence benchmarking the performance of models built upon them.
Single-cell foundation models rely on large-scale, curated data repositories for pretraining. The table below summarizes the key characteristics of the primary data sources available.
| Atlas Name | # Cells | Lead Organization | # Species | Primary URL |
|---|---|---|---|---|
| CZ CELLxGENE Discover | 112.8 M | Chan Zuckerberg Initiative (CZI) | 7 | https://cellxgene.cziscience.com/ |
| Human Cell Atlas (HCA) | 65.4 M | HCA Consortium | 1 | https://data.humancellatlas.org/ |
| DISCO | 125.6 M | Singapore Immunology Network | 1 | https://www.immunesinglecell.org |
| Single Cell Portal | 57.6 M | Broad Institute | 18 | https://singlecell.broadinstitute.org/ |
| Single Cell Expression Atlas | 13.5 M | EMBL-EBI | 21 | https://www.ebi.ac.uk/gxa/sc/home |
| Allen Brain Cell Atlas | 4.0 M | Allen Institute | 1 | https://portal.brain-map.org/ |
Source: Adapted from PMC[citiation:7]. Note: Cell counts are approximate and as of the time of writing.
Platforms like CZ CELLxGENE provide unified access to millions of annotated single-cell datasets, serving as a cornerstone for the scFM ecosystem [1] [19]. The Human Cell Atlas (HCA) is another monumental project that aggregates data from thousands of studies, regularly updating its portal with new and updated projects [20]. Public repositories such as the Gene Expression Omnibus (GEO) and the Sequence Read Archive (SRA) host thousands of individual studies, which researchers then integrate into large training corpora [1].
Despite their sophisticated architecture and pretraining on massive datasets, both scGPT and scFoundation have been shown to underperform compared to simpler models in predicting post-perturbation gene expression. The following table summarizes key quantitative results from independent benchmarks.
| Model / Baseline | Performance Summary (Pearson Delta, Differential Expression) | Key Benchmarking Finding |
|---|---|---|
| scGPT | Adamson: 0.641, Norman: 0.554, Replogle K562: 0.327, Replogle RPE1: 0.596 [18] | Underperformed versus simple mean baseline and random forest models. |
| scFoundation | Adamson: 0.552, Norman: 0.459, Replogle K562: 0.269, Replogle RPE1: 0.471 [18] | Underperformed versus simple mean baseline and random forest models. |
| Train Mean (Baseline) | Adamson: 0.711, Norman: 0.557, Replogle K562: 0.373, Replogle RPE1: 0.628 [18] | The simplest baseline, which predicts the average expression from training data, outperformed both foundation models. |
| Random Forest + GO Features | Adamson: 0.739, Norman: 0.586, Replogle K562: 0.480, Replogle RPE1: 0.648 [18] | Outperformed foundation models by a large margin by using biologically meaningful features. |
| Additive Model (Baseline) | Outperformed all deep learning models in predicting double perturbation effects [14]. | A simple baseline that sums the effects of single perturbations was not beaten by any complex model. |
| Linear Model with Pretrained Embeddings | Performed as well as or better than scGPT and GEARS with their built-in decoders [14]. | Using embeddings from scFMs in a simple linear model was more effective than using the models' own complex architectures. |
A study published in Nature Methods (2025) reached a similar stark conclusion, finding that no deep-learning model could consistently outperform the simple mean prediction or a linear model in predicting the effects of unseen single-gene perturbations [14]. Furthermore, in predicting double-gene perturbations, even the simplistic "additive" baseline model, which sums the effects of two single perturbations, proved superior to all foundation models [14].
To ensure fair comparisons, independent benchmarks have employed rigorous and consistent methodologies.
Diagram of the scFM pretraining and benchmarking workflow.
The following table details essential resources and their functions in this field, from data portals to computational tools.
| Resource Name | Type | Primary Function in Research |
|---|---|---|
| CELLxGENE Discover | Data Portal | Provides unified access to over 100 million curated single-cells for discovery and analysis [1] [19]. |
| HCA Data Portal | Data Portal | Centralized platform to explore and download data from the Human Cell Atlas project [20]. |
| Perturb-seq | Experimental Technology | Combines CRISPR-based genetic perturbations with single-cell RNA sequencing to generate ground-truth data for benchmarking [18]. |
| Gene Ontology (GO) | Knowledge Base | Provides structured biological knowledge features (e.g., functional annotations) that can be used to build highly predictive baseline models [18]. |
| Random Forest Regressor | Computational Model | A classic machine learning algorithm that, when provided with GO features, has been shown to outperform complex foundation models [18]. |
| Linear Model with Embeddings | Computational Model | A simple model that uses pretrained gene embeddings from scFMs as input, often outperforming the original complex models [14]. |
In conclusion, while data sources like CELLxGENE and the Human Cell Atlas are invaluable for pretraining scFMs, current evidence indicates that the models themselves may not yet be leveraging this data effectively for perturbation prediction. Researchers should consider these benchmarking results and the power of simple, biologically-informed baselines when designing and evaluating their own studies.
The accurate prediction of drug response is a critical challenge in modern oncology, directly impacting the development of effective cancer therapies and the understanding of drug resistance mechanisms. Single-cell RNA sequencing (scRNA-seq) technology has emerged as a powerful tool for characterizing the cellular heterogeneity that underpins varying treatment outcomes [21]. Recently, large-scale foundation models pre-trained on massive biological datasets have shown potential for enhancing single-cell analysis. This guide provides an objective performance comparison of two prominent foundation models—scGPT and scFoundation—within the scDrugMap benchmarking framework, offering researchers evidence-based insights for model selection in drug response prediction tasks.
The scDrugMap framework represents the first comprehensive benchmark for evaluating large foundation models on drug response prediction using single-cell data. It incorporates a curated resource of over 326,000 cells from 36 datasets across 23 studies, spanning diverse cancer types, tissues, and treatment regimens [11] [22] [21]. The framework evaluates models under two distinct scenarios—pooled-data evaluation and cross-data evaluation—implementing both layer freezing and Low-Rank Adaptation (LoRA) fine-tuning strategies [21].
Table 1: Overall Performance Comparison of scGPT and scFoundation in scDrugMap Benchmark
| Model | Pooled-Data Evaluation (F1 Score) | Cross-Data Evaluation (F1 Score) | Key Strengths |
|---|---|---|---|
| scFoundation | 0.971 (layer freezing)0.947 (fine-tuning) | Not reported | Excels in pooled-data scenarios with extensive training data |
| scGPT | Not best performer | 0.858 (zero-shot) | Superior cross-dataset generalization with zero-shot learning |
| UCE | Not best performer | 0.774 (fine-tuning on tumor tissue) | Strong performance after fine-tuning on specific tissue types |
Table 2: Detailed Performance Metrics Across Evaluation Settings
| Evaluation Scenario | Training Strategy | scFoundation Performance | scGPT Performance | Top Performing Model |
|---|---|---|---|---|
| Pooled-Data | Layer Freezing | 0.971 (F1) | Lower than scFoundation | scFoundation |
| Pooled-Data | LoRA Fine-tuning | 0.947 (F1) | Lower than scFoundation | scFoundation |
| Cross-Data | Zero-Shot Learning | Lower than scGPT | 0.858 (F1) | scGPT |
| Cross-Data | Fine-tuning | Not best performer | Lower than UCE | UCE (0.774 F1) |
In the pooled-data evaluation scenario, where models are trained and tested on aggregated data from multiple studies, scFoundation demonstrated superior performance compared to all other models, including scGPT [11] [21]. scFoundation achieved the highest mean F1 scores of 0.971 with layer freezing and 0.947 with fine-tuning, outperforming the lowest-performing model by over 50% [21]. This indicates that scFoundation excels in contexts where substantial training data from multiple sources is available, effectively leveraging its pre-training on large-scale single-cell data.
In cross-data evaluation, where models are tested independently on datasets from individual studies to assess generalization capabilities, scGPT demonstrated superior performance in zero-shot learning with a mean F1 score of 0.858 [21]. This highlights scGPT's stronger generalization to unseen data distributions without additional training. After fine-tuning on tumor tissue, UCE achieved the highest performance (mean F1: 0.774) in this setting [21], suggesting that model performance is highly dependent on both the base architecture and the adaptation strategy.
Independent benchmarking studies beyond scDrugMap have revealed important limitations in current foundation models for biological prediction tasks. Research published in Nature Methods found that neither scGPT nor scFoundation outperformed deliberately simple baselines for predicting genetic perturbation effects [14]. Simple models—including taking the mean of training examples or using basic machine learning models with biologically meaningful features—often outperformed these foundation models by a substantial margin [5] [14].
Similarly, zero-shot evaluations published in Genome Biology demonstrated that both scGPT and Geneformer underperform simpler methods like highly variable gene selection and established integration tools (Harmony, scVI) on tasks including cell type clustering and batch integration [16] [15]. These findings highlight that while foundation models show promise, their practical utility for drug response prediction requires careful validation against simpler alternatives.
The scDrugMap framework incorporates a primary collection of 326,751 single tumor cells from 36 scRNA-seq datasets across 23 studies, with representation of 14 cancer types, 3 therapy categories (targeted therapy, chemotherapy, immunotherapy), and multiple tissue types (cell lines, bone marrow aspirates, tumor tissue, PBMCs) [21]. An independent validation collection includes 18,856 cells from 17 datasets across 6 studies [21]. This comprehensive coverage ensures robust benchmarking across diverse biological contexts.
scDrugMap implements two primary adaptation strategies for foundation models:
The benchmarking protocol employs the F1 score as the primary metric, providing a balanced measure of precision and recall for drug response prediction [21]. The evaluation follows rigorous data splitting strategies appropriate for each scenario, with cross-validation in pooled-data settings and leave-one-dataset-out validation in cross-data settings to ensure reliable performance estimation.
Table 3: Essential Research Resources for scDrugMap-Style Benchmarking
| Resource Category | Specific Examples | Function in Research |
|---|---|---|
| Single-Cell Foundation Models | scFoundation, scGPT, UCE, scBERT, Geneformer, cellLM, cellPLM | Base pre-trained models for transfer learning and zero-shot evaluation |
| Large Language Models | LLaMa3-8B, GPT4o-mini | General-purpose models adaptable for biological sequence analysis |
| Training Adaptation Methods | Low-Rank Adaptation (LoRA), Layer Freezing | Parameter-efficient fine-tuning strategies for model specialization |
| Computational Frameworks | scDrugMap (Python CLI & Web Server), BioLLM | Standardized interfaces for model integration and evaluation |
| Benchmark Datasets | Primary Collection (326,751 cells), Validation Collection (18,856 cells) | Curated single-cell data with drug response annotations for training and testing |
| Evaluation Metrics | F1 Score, Pearson Correlation, Differential Expression Analysis | Quantitative performance assessment for model comparison |
The scDrugMap benchmarking framework provides comprehensive evidence that both scFoundation and scGPT offer distinct strengths for drug response prediction, with the optimal choice dependent on the specific research context and application requirements. scFoundation demonstrates superior performance in pooled-data scenarios where substantial training data is available, while scGPT excels in cross-data evaluation with stronger zero-shot generalization capabilities. However, independent studies consistently show that simpler models can sometimes outperform these foundation models, highlighting the importance of rigorous benchmarking against appropriate baselines. Researchers should select models based on their specific use case, data availability, and generalization requirements, while remaining critical of model claims and validating performance against simpler alternatives.
The ability to accurately predict cellular responses to genetic perturbations is a cornerstone of functional genomics and therapeutic discovery. Technologies like Perturb-seq, which combines CRISPR-based interventions with single-cell RNA sequencing, have generated vast datasets detailing these responses. In response, the computational biology community has developed sophisticated "foundation" models, pre-trained on millions of single-cell transcriptomes, to tackle this prediction problem. Two prominent models, scGPT and scFoundation, have emerged as state-of-the-art candidates. However, rigorous and independent benchmarking is crucial to validate their performance claims. This guide synthesizes evidence from recent, comprehensive studies to objectively compare the predictive performance of these foundation models against each other and, importantly, against simpler baseline approaches. The overarching finding across multiple independent investigations is that despite their complexity and computational cost, these foundation models currently fail to consistently outperform deliberately simple baselines, highlighting significant challenges and opportunities for improvement in the field.
Recent benchmark studies have systematically evaluated scGPT and scFoundation against a range of simpler models on the task of predicting post-perturbation gene expression profiles. The consistent result is that foundation models are often outperformed by straightforward alternatives.
Table 1: Performance Comparison on Perturbation Prediction Tasks (Pearson Correlation in Differential Expression Space)
| Model / Dataset | Adamson et al. | Norman et al. | Replogle (K562) | Replogle (RPE1) |
|---|---|---|---|---|
| Train Mean (Baseline) | 0.711 | 0.557 | 0.373 | 0.628 |
| scGPT | 0.641 | 0.554 | 0.327 | 0.596 |
| scFoundation | 0.552 | 0.459 | 0.269 | 0.471 |
| Random Forest (GO Features) | 0.739 | 0.586 | 0.480 | 0.648 |
Data Source: [5]
The data reveals that the simplest baseline, which predicts the average expression profile from the training data ("Train Mean"), reliably outperforms both scGPT and scFoundation across multiple datasets [5]. Even more notably, a standard Random Forest model using Gene Ontology (GO) biological pathway features as input "outperformed foundation models by a large margin" [5]. This suggests that incorporating structured biological prior knowledge can be more effective than the representations learned through the foundation models' pre-training on vast amounts of single-cell data.
In a separate study published in Nature Methods, a simple additive model—which sums the individual logarithmic fold changes of single perturbations to predict the effect of a double perturbation—proved superior to five foundation models and two other deep learning approaches [14]. Furthermore, when tasked with predicting genetic interactions (where the effect of a double perturbation is non-additive), none of the deep learning models performed better than a "no change" baseline that always predicts the control condition [14].
Understanding the methodology behind these benchmarks is critical for interpreting the results and for researchers aiming to conduct their own evaluations.
The benchmarks rely on publicly available Perturb-seq datasets, which use CRISPR to perturb genes and single-cell RNA sequencing to measure the transcriptional outcome. Key datasets include:
For evaluation, single-cell predictions are typically aggregated by perturbation to create pseudo-bulk expression profiles, which are compared to the ground truth pseudo-bulk profiles [5].
The core evaluation metric is often the Pearson correlation, calculated in two key spaces:
The primary task is Perturbation Exclusive (PEX) prediction, where the model's ability to generalize to the effects of completely unseen perturbations or combinations is tested [5] [23].
For the foundation models, the standard protocol involves taking a model that has been pre-trained on a large corpus of single-cell data (often >10 million cells) and then fine-tuning it on the specific perturbation dataset of interest. The baseline models, such as Random Forest or k-Nearest Neighbors, are trained directly on the perturbation data using features derived from biological databases or the foundation models' own gene embeddings [5].
The following diagram illustrates the standard workflow for training and evaluating perturbation response prediction models, as used in the cited benchmarks.
Diagram 1: Benchmarking Workflow for Perturbation Prediction Models. This workflow compares foundation models (fine-tuned on Perturb-seq data) against baseline models trained directly on the data with biological features. Performance is evaluated by comparing predictions to the experimental ground truth.
Successful perturbation modeling relies on a combination of computational tools and curated biological datasets. The table below details essential "research reagents" for this field.
Table 2: Essential Research Reagents for Perturbation Modeling
| Resource Name | Type | Primary Function in Perturbation Modeling |
|---|---|---|
| Perturb-seq Datasets (Adamson, Norman, Replogle) | Experimental Data | Provides the ground truth data of gene expression responses to genetic perturbations, used for model training and benchmarking [5] [14]. |
| Gene Ontology (GO) / KEGG/ REACTOME | Biological Database | Curated knowledge bases of biological pathways and functions. Used as informative features for baseline machine learning models [5]. |
| CollecTRI | Biological Database | A comprehensive gene regulatory network resource. Used to evaluate the biological meaningfulness of learned gene embeddings [5]. |
| PerturBench | Computational Framework | A modular codebase for standardized development and evaluation of perturbation prediction models, ensuring fair comparisons [23]. |
| BioLLM | Computational Framework | A unified interface that integrates diverse single-cell foundation models (scGPT, Geneformer, scFoundation), streamlining their application and evaluation [17]. |
The independent benchmarking of scGPT and scFoundation reveals a critical and consistent finding: as of early 2025, these complex foundation models do not surpass the performance of simple baseline methods for predicting cellular responses to genetic perturbations. Models that predict the average training response or use off-the-shelf biological features in a Random Forest regressor consistently set a high bar. This does not negate the potential of the foundation model approach but underscores that the field is still in its early stages. Future progress will likely depend on improved model architectures, more effective pre-training strategies, and the development of benchmarking standards that more accurately reflect the complex biological reality of perturbation responses. For researchers and drug developers, the current evidence strongly suggests that simpler, interpretable models should be included as robust baselines in any project aiming to predict genetic perturbation effects.
The advent of single-cell foundation models (scFMs) represents a paradigm shift in computational biology, offering unprecedented opportunities to advance cancer drug response prediction (DRP). Models like scGPT and scFoundation, built on transformer architectures pretrained on millions of single cells, promise to capture universal biological principles that can be specialized for downstream tasks like DRP [1]. These models employ sophisticated tokenization strategies where genes become input tokens analogous to words in a sentence, with expression values providing additional context [4]. The fundamental premise is that exposure to diverse cellular states across tissues and conditions enables these models to learn generalized representations of cellular behavior that can enhance predictive accuracy for specific applications like DeepCDR.
However, integrating these powerful models into existing DRP pipelines requires careful benchmarking to identify their relative strengths, limitations, and optimal implementation strategies. Recent comprehensive evaluations reveal a complex performance landscape where scFMs demonstrate significant potential but also notable limitations compared to simpler approaches [5] [14] [4]. This comparison guide provides an objective assessment of scGPT versus scFoundation performance to inform effective integration with DeepCDR frameworks, supported by experimental data and implementation protocols.
Accurately predicting cellular responses to genetic and chemical perturbations is fundamental to DRP. Benchmarking studies directly compared scGPT and scFoundation against baseline models for predicting transcriptome changes after single and double genetic perturbations using Perturb-seq datasets (Adamson, Norman, and Replogle) [5] [14].
Table 1: Performance Comparison in Perturbation Prediction (Pearson Delta Metric)
| Model | Adamson Dataset | Norman Dataset | Replogle K562 | Replogle RPE1 |
|---|---|---|---|---|
| Train Mean (Baseline) | 0.711 | 0.557 | 0.373 | 0.628 |
| scGPT | 0.641 | 0.554 | 0.327 | 0.596 |
| scFoundation | 0.552 | 0.459 | 0.269 | 0.471 |
| Random Forest with GO | 0.739 | 0.586 | 0.480 | 0.648 |
Surprisingly, even simple baseline models like Train Mean (predicting the average of training examples) outperformed both foundation models across all datasets [5]. Similarly, a linear baseline model that sums individual logarithmic fold changes for double perturbations substantially outperformed scGPT, scFoundation, and other deep learning models [14]. Random Forest models incorporating biological prior knowledge (Gene Ontology features) achieved the best performance, surpassing scGPT by a large margin [5].
For practical implementation in DRP pipelines, zero-shot performance (without task-specific fine-tuning) is crucial for exploratory applications where labeled data is limited. Evaluation of zero-shot capabilities for cell type annotation and batch integration revealed important limitations:
Table 2: Zero-Shot Performance Across Biological Tasks
| Model | Cell Type Annotation (AvgBIO) | Batch Integration | Biological Relevance |
|---|---|---|---|
| scGPT | Inconsistent; outperformed by scVI and Harmony on most datasets | Moderate technical batch correction; struggles with biological variation | Captures some biological pathways |
| Geneformer | Consistently outperformed by simple HVG selection | Poor performance; embeddings often dominated by batch effects | Limited biological relevance in embeddings |
| scFoundation | Not extensively evaluated in zero-shot | Not extensively evaluated in zero-shot | Gene embeddings show biological utility |
In zero-shot cell type clustering, both scGPT and Geneformer were consistently outperformed by established methods like Harmony, scVI, and even simple highly variable genes (HVG) selection [15]. Notably, selecting HVGs achieved the best batch integration scores across all datasets, highlighting the performance gap for foundation models in zero-shot settings [15].
To ensure reproducible evaluation of scFMs for DRP applications, researchers should implement standardized benchmarking protocols mirroring recent comprehensive studies:
Data Preparation and Partitioning:
Evaluation Metrics:
Baseline Models:
Feature Extraction Approach: Rather than using scFMs as end-to-end predictors, extract gene and cell embeddings as features for traditional machine learning models. Random Forest models using scGPT embeddings achieved better performance than fine-tuned scGPT itself, though still underperforming compared to biological prior knowledge features [5].
Hybrid Prediction Framework: Implement ensemble approaches combining scFM embeddings with biological knowledge features. Studies show that incorporating Gene Ontology vectors and pathway information significantly boosts prediction accuracy compared to using foundation model outputs alone [5] [4].
Diagram 1: Enhanced DeepCDR Integration Framework. This workflow combines foundation model embeddings with traditional machine learning and biological prior knowledge for improved drug response prediction.
Understanding the fundamental architectural differences between scGPT and scFoundation is essential for effective integration:
Table 3: Model Architectures and Training Specifications
| Parameter | scGPT | scFoundation |
|---|---|---|
| Architecture | GPT-style decoder with unidirectional attention | BERT-style encoder with bidirectional attention |
| Parameters | ~50 million | ~100 million |
| Pretraining Data | 33 million non-cancerous human cells | 50 million single cells |
| Input Genes | 1,200 highly variable genes | 19,264 protein-encoding genes |
| Tokenization | Value binning combined with gene embeddings | Gene embeddings with value projection |
| Positional Encoding | Not used | Not used |
| Primary Pretraining Task | Iterative masked gene modeling with MSE loss | Read-depth-aware masked gene modeling |
scGPT employs a GPT-style decoder architecture pretrained on 33 million non-cancerous human cells, using value binning for expression levels and focusing on highly variable genes [4] [1]. In contrast, scFoundation utilizes a BERT-style encoder trained on 50 million cells with nearly complete gene coverage, implementing read-depth-aware masking during pretraining [4].
Table 4: Essential Research Resources for scFM Integration
| Resource | Type | Function in DRP Research |
|---|---|---|
| GDSC Database | Drug screening dataset | Primary source of drug response data (IC50 values) for model training and validation |
| CCLE | Cell line database | Provides multi-omics profiles of cancer cell lines for feature generation |
| Perturb-seq Datasets | Genetic perturbation data | Enables model benchmarking for perturbation response prediction |
| PubChem | Chemical database | Source of drug molecular representations (fingerprints, SMILES strings) |
| Gene Ontology | Biological knowledge base | Provides prior knowledge features for enhancing prediction accuracy |
| BioLLM Framework | Software framework | Standardized APIs for integrating and evaluating multiple scFMs |
Based on comprehensive benchmarking evidence, the following integration approaches are recommended:
Prioritize Feature Extraction over End-to-End Learning: Instead of using scFMs as complete DRP solutions, extract their gene and cell embeddings as input features for established DeepCDR architectures. Experimental results demonstrate that Random Forest models using scGPT embeddings outperformed fine-tuned scGPT while maintaining computational efficiency [5].
Implement Ensemble Strategies: Combine foundation model outputs with biological prior knowledge. Studies consistently show that models incorporating Gene Ontology features and pathway information achieve superior performance compared to standalone scFM predictions [5] [25].
Leverage scGPT for Blood-Derived Cancers: scGPT demonstrates stronger performance on blood and bone marrow datasets compared to other tissue types, suggesting prioritized integration for hematological malignancies [15].
Utilize scFoundation for Comprehensive Gene Coverage: When full transcriptome analysis is required, scFoundation's coverage of 19,264 protein-encoding genes provides advantages over scGPT's highly-variable-gene approach [4].
Despite their theoretical promise, current scFMs show consistent limitations that warrant consideration:
Simplicity-Performance Paradox: Across multiple benchmarks, simple baseline models consistently matched or outperformed sophisticated foundation models. The "additive model" for genetic interactions and "train mean" for perturbation response provided competitive baselines [5] [14].
Specialized DRP Model Superiority: Models specifically designed for drug response prediction, such as GraphTCDR (utilizing heterogeneous graph neural networks) and SubCDR (employing subcomponent-guided deep learning), demonstrated superior performance compared to general-purpose scFMs [25] [26].
Computational Efficiency Trade-offs: The substantial computational resources required for scFM fine-tuning may not be justified given their current performance limitations, especially when simpler models achieve comparable or better results [14].
Diagram 2: Model Selection Framework for DRP. This decision flow prioritizes simpler, biologically-informed approaches based on benchmarking evidence, with foundation models reserved for specialized cases.
Integration of single-cell foundation models with DeepCDR frameworks offers promising avenues for enhancing cancer drug response prediction, but requires careful, evidence-based implementation. Current benchmarking reveals that while scGPT and scFoundation provide valuable biological representations, they rarely outperform simpler approaches as end-to-end solutions and show significant limitations in zero-shot settings.
For immediate DeepCDR enhancement, a hybrid approach leveraging scGPT embeddings as input features to traditional machine learning models, augmented with biological prior knowledge, represents the most promising integration path. This strategy combines the representation learning capabilities of foundation models with the proven predictive power of established DRP methodologies. As the scFM field rapidly evolves, continued rigorous benchmarking against simple baselines remains essential to distinguish genuine algorithmic advances from incremental improvements that fail to translate to practical predictive performance.
Within the rapidly evolving field of single-cell biology, foundation models pretrained on millions of cells promise to serve as versatile tools for a wide array of downstream tasks. The "pre-train then fine-tune" paradigm aims to capture universal patterns of gene regulation and cell behavior, which can then be efficiently adapted to specific applications. This guide provides an objective, data-driven comparison of two prominent foundation models—scGPT and scFoundation—focusing on their performance across three critical tasks: cell type annotation, batch correction, and gene network inference. The analysis is framed within the broader context of benchmarking studies that seek to evaluate whether these complex models deliver tangible advantages over simpler, more established computational methods. The findings summarized here are based on recent peer-reviewed literature and preprints that have conducted rigorous, multi-faceted benchmarks.
The following tables summarize the quantitative performance of scGPT and scFoundation against various baseline models across different tasks. The data is aggregated from multiple large-scale benchmarking studies.
Table 1: Performance on Perturbation Effect Prediction (Differential Expression Space)
| Model | Adamson Dataset (Pearson Delta) | Norman Dataset (Pearson Delta) | Replogle K562 (Pearson Delta) | Replogle RPE1 (Pearson Delta) |
|---|---|---|---|---|
| Train Mean (Simplest Baseline) | 0.711 | 0.557 | 0.373 | 0.628 |
| scGPT | 0.641 | 0.554 | 0.327 | 0.596 |
| scFoundation | 0.552 | 0.459 | 0.269 | 0.471 |
| Random Forest with GO Features | 0.739 | 0.586 | 0.480 | 0.648 |
Table 2: Zero-Shot Performance on Cell-Level Tasks [4] [15]
| Model | Cell Type Annotation (AvgBio Score) | Batch Integration (iLISI Score) | Computational Resources |
|---|---|---|---|
| scGPT | Variable; outperformed by scVI and Harmony on most datasets [15] | Good on complex datasets with biological batch effects [15] | 50 M parameters [4] |
| scFoundation | Information missing | Information missing | 100 M parameters [4] |
| Geneformer | Consistently outperformed by simpler baselines [15] | Poor; often worsened batch effects [15] | 40 M parameters [4] |
| Baseline: scVI / Harmony | Consistently high performance [15] | Consistently high performance [15] | Lower resource requirements |
Table 3: Performance on Gene Network Inference [27]
| Model Category | Representative Methods | Precision | Recall | Leverages Interventional Data |
|---|---|---|---|---|
| Observational Methods | PC, GES, NOTEARS | Low to Moderate | Low to Moderate | No |
| Interventional Methods | GIES, DCDI | Low to Moderate | Low to Moderate | Yes, but with limited benefit |
| Challenge-Winning Methods | Mean Difference, Guanlab | High | High | Yes |
| Foundation Models | scGPT, scFoundation | Not consistently top-ranked [27] | Not consistently top-ranked [27] | Information missing |
To ensure the reproducibility of the results and a fair understanding of the comparisons, this section outlines the key experimental methodologies shared across the cited benchmarking studies.
1. Datasets: Benchmarks primarily used Perturb-seq datasets, including:
2. Task Formulation: The core task was framed as a Perturbation Exclusive (PEX) prediction. Models were trained on a set of perturbations and then tested on their ability to predict the gene expression profile of held-out, unseen perturbations [5].
3. Evaluation Metrics:
4. Baseline Models:
1. Feature Extraction: Models were evaluated in a zero-shot setting. This means their pretrained weights were frozen, and cell (or gene) embeddings were extracted without any further task-specific fine-tuning. This tests the generalizable biological knowledge acquired during pretraining [4] [15].
2. Downstream Tasks & Evaluation:
1. Benchmark Suite: Evaluations were conducted using CausalBench, a suite designed for network inference on real-world, large-scale single-cell perturbation data [27].
2. Ground Truth Challenge: Since the true causal graph is unknown in biological systems, CausalBench uses biologically-motivated metrics and distribution-based interventional measures instead of a known graph [27].
3. Evaluation Metrics:
The following diagrams illustrate the core architectures of the foundation models and the workflow for a typical benchmarking study.
Diagram 1: Model Architectures Comparison.
Diagram 2: Benchmarking Workflow.
Table 4: Key Computational Tools and Datasets for Benchmarking
| Tool / Dataset Name | Type | Primary Function in Benchmarking | Key Features / Notes |
|---|---|---|---|
| Perturb-seq Datasets (Adamson, Norman, Replogle) | Biological Dataset | Provides ground-truth data for training and evaluating perturbation prediction models. | Combines CRISPR perturbations with single-cell RNA-seq; enables causal inference [5] [14]. |
| CausalBench | Software Benchmark Suite | Evaluates gene network inference methods on real-world interventional data. | Provides biologically-motivated metrics in the absence of a known ground-truth graph [27]. |
| Gene Ontology (GO) | Knowledge Database | Provides biological prior knowledge features for baseline machine learning models. | A graph-based ontology of biological terms; used as feature vectors for genes [5]. |
| Harmony | Computational Algorithm | A leading baseline method for batch integration of single-cell data. | Used as a performance benchmark for foundation models' batch correction capabilities [15]. |
| scVI | Computational Algorithm | A generative deep learning model for single-cell data analysis. | Used as a performance benchmark for tasks like cell type annotation and batch integration [15]. |
The emergence of single-cell foundation models (scFMs) like scGPT and scFoundation represents a transformative development in computational biology, promising to leverage patterns learned from millions of cells to predict cellular behavior. These models are designed to capture universal principles of gene regulation and cell state dynamics, with the ultimate goal of accurately predicting cellular responses to genetic and chemical perturbations—a capability with profound implications for drug discovery and therapeutic development. However, rigorous independent benchmarking has revealed a surprising paradox: these sophisticated models frequently fail to outperform deliberately simple baseline methods in critical zero-shot learning scenarios, where models must make predictions without task-specific fine-tuning.
This performance gap exposes fundamental challenges in current approaches to model development and evaluation within the single-cell domain. Understanding the limitations of these models is not merely an academic exercise but a practical necessity for researchers and drug development professionals who rely on computational predictions to guide experimental design and resource allocation. This guide provides an objective comparison of scGPT and scFoundation against simpler alternatives, presenting comprehensive experimental data and methodologies to inform selection criteria for perturbation prediction tasks.
Zero-shot evaluation assesses models on tasks they haven't been specifically fine-tuned for, testing their ability to generalize beyond their original training objectives [16]. This approach is particularly valuable for assessing foundation models because it mirrors real-world discovery settings where labeled data for specific perturbations may be unavailable [28]. Proper benchmark design must account for multiple generalization scenarios, including Perturbation Exclusive (PEX) settings where models predict effects of entirely novel perturbations, and Cell Exclusive (CEX) settings where models generalize to unseen cell types or contexts [5].
The most informative benchmarks incorporate multiple datasets with varying technical and biological complexities. For single-cell perturbation prediction, ideal benchmarks should include datasets generated using different experimental techniques (e.g., CRISPRi, CRISPRa), across multiple cell lines, and with both single and combinatorial perturbations [5] [14]. This diversity helps distinguish models that have learned fundamental biological principles from those that have merely memorized dataset-specific correlations.
Table 1: Core Evaluation Methodologies for Perturbation Prediction
| Method Category | Representative Examples | Key Characteristics | Primary Applications |
|---|---|---|---|
| Foundation Models | scGPT, scFoundation, Geneformer | Transformer architectures pre-trained on millions of cells; require fine-tuning or used zero-shot | Cell type annotation, perturbation response prediction, batch integration |
| Baseline Models | Train Mean, Additive Model | Predict average of training samples or sum of individual effects | Simple benchmarks for model performance |
| Traditional ML | Random Forest, k-NN, ElasticNet | Use biological features (GO terms, embeddings); limited parameters | Perturbation prediction with biological priors |
| Linear Models | Matrix factorization approaches | Learn low-dimensional representations of genes and perturbations | Predicting effects of unseen perturbations |
Independent evaluations have employed several consistent methodological approaches across studies. For perturbation prediction, models typically receive as input gene expression vectors from unperturbed cells along with a representation of the perturbation, then generate predicted post-perturbation expression profiles [5]. Predictions are made at single-cell level but are often aggregated to pseudo-bulk profiles for evaluation stability.
The most critical evaluation metric involves calculating Pearson correlations in the differential expression space (perturbed minus control expression) rather than raw expression space, as the latter tends to be dominated by baseline expression levels of highly expressed genes [5] [14]. Additional evaluation dimensions include performance on top differentially expressed genes, genetic interaction prediction (for combinatorial perturbations), and generalization to unseen cell types or conditions.
Figure 1: Experimental workflow for benchmarking perturbation prediction models, showing input data types, model approaches, and evaluation strategies.
Independent benchmarking studies have consistently demonstrated that simpler approaches frequently match or exceed the performance of foundation models in predicting transcriptional responses to genetic perturbations. In one comprehensive evaluation, the simplest baseline—predicting the mean of training examples (Train Mean)—outperformed both scGPT and scFoundation across four different Perturb-seq datasets when measuring Pearson correlation in differential expression space [5]. More notably, random forest regressors using Gene Ontology (GO) vectors as features substantially outperformed foundation models by a large margin (Pearson Delta: 0.739 vs. 0.641 for scGPT on the Adamson dataset) [5].
Table 2: Performance Comparison Across Perturbation Datasets (Pearson Delta Metric)
| Model | Adamson Dataset | Norman Dataset | Replogle K562 | Replogle RPE1 |
|---|---|---|---|---|
| Train Mean | 0.711 | 0.557 | 0.373 | 0.628 |
| scGPT | 0.641 | 0.554 | 0.327 | 0.596 |
| scFoundation | 0.552 | 0.459 | 0.269 | 0.471 |
| RF + GO Features | 0.739 | 0.586 | 0.480 | 0.648 |
| RF + scGPT Embeddings | 0.727 | 0.583 | 0.421 | 0.635 |
For the challenging task of predicting double perturbation effects, foundation models have shown particular limitations. In evaluations using the Norman dataset (combinatorial CRISPRa perturbations), even the "no change" baseline—which always predicts expression identical to control conditions—outperformed specialized deep learning models including GEARS, scGPT, and scFoundation [14]. Furthermore, foundation models demonstrated poor capability in identifying genetic interactions, with most models predominantly predicting buffering interactions and rarely correctly predicting synergistic effects [14].
Beyond perturbation prediction, zero-shot performance is crucial for exploratory tasks like cell type identification and batch integration, where predefined labels may be unavailable. Evaluations across multiple datasets reveal that both scGPT and Geneformer underperform established baselines in these settings [16]. In cell type clustering, selecting highly variable genes (HVG) consistently outperformed both foundation models across average BIO score and average silhouette width metrics [16]. Similarly, for batch integration tasks, foundation models generally failed to adequately correct for technical batch effects while preserving biological signal, with Harmony and scVI demonstrating superior performance [16].
The performance gap in batch integration is particularly striking. Visualization of embeddings from the Pancreas benchmark dataset (containing data from five different sources) revealed that while Geneformer and scGPT could integrate different experiments using the same technique, they generally failed to correct for batch effects between different techniques [16]. In these visualizations, the primary structure in foundation model embedding spaces was driven by batch effects rather than biological meaningful categories.
The consistent underperformance of foundation models relative to simpler alternatives suggests systemic limitations in current approaches rather than implementation-specific issues. Several interconnected factors contribute to this performance gap:
Pretraining-finetuning mismatch: The masked language modeling objective used during pretraining may not optimally prepare models for perturbation prediction tasks [16]. While this approach effectively teaches models gene-gene correlations in baseline states, it provides limited guidance for predicting dynamic responses to interventions.
Low perturbation-specific variance: Commonly used benchmark datasets exhibit limited perturbation-specific signal relative to technical and biological noise [5]. This low signal-to-noise ratio makes it difficult for complex models to distinguish meaningful patterns from stochastic variation.
Inefficient knowledge extraction: Despite extensive pretraining, foundation models may not effectively distill biologically meaningful representations. This is evidenced by the superior performance of random forest models using foundation model embeddings compared to the end-to-end fine-tuned models themselves [5].
Figure 2: Key limitations of current single-cell foundation models that contribute to their underperformance relative to simpler approaches.
Recent research has begun addressing these limitations through innovative architectural and methodological improvements:
Efficient fine-tuning techniques: Approaches like the single-cell Drug-Conditional Adapter (scDCA) enable parameter-efficient fine-tuning by training less than 1% of original foundation model parameters while incorporating information from novel modalities (e.g., chemical structures) [29]. This preserves rich biological representations learned during pretraining while adapting to specific prediction tasks.
Enhanced benchmarking frameworks: Unified evaluation frameworks like BioLLM provide standardized APIs for consistent model comparison across diverse tasks, revealing distinct performance trade-offs across different scFM architectures [17]. Such frameworks enable more rigorous and reproducible model assessment.
Knowledge-enhanced representations: Incorporating structured biological knowledge through knowledge graphs has shown promise in other zero-shot learning domains [30], suggesting potential pathways for improving single-cell foundation models through explicit integration of pathway and regulatory network information.
Table 3: Essential Computational Tools for Perturbation Modeling
| Tool | Type | Primary Function | Considerations |
|---|---|---|---|
| scGPT | Foundation Model | Multi-task single-cell analysis | Strong overall performer; benefits from extensive pretraining |
| scFoundation | Foundation Model | Genetic perturbation prediction | Requires specific gene sets; limited flexibility |
| Geneformer | Foundation Model | Context-aware predictions | Limited zero-shot capabilities |
| Harmony | Batch Integration | Multi-dataset integration | Superior technical effect correction |
| scVI | Probabilistic Model | Dimensionality reduction, integration | Effective biological preservation |
| GEARS | Specialized Model | Genetic perturbation prediction | Utilizes prior knowledge of gene interactions |
For researchers seeking to implement perturbation prediction capabilities, evidence suggests the following strategic approaches:
For predicting novel genetic perturbations: Begin with random forest models using Gene Ontology features or pre-trained gene embeddings, as these consistently outperform more complex alternatives while offering greater computational efficiency and interpretability [5].
For zero-shot cell type identification: Prioritize established methods like Harmony or scVI over foundation models, as both demonstrate superior batch correction and cell type separation without requiring fine-tuning [16].
For predicting chemical perturbation effects: Consider efficient fine-tuning approaches like scDCA when foundation models must be employed, as these preserve pretrained knowledge while adapting to novel modalities with minimal parameter updates [29].
For benchmarking new models: Implement simple baselines (Train Mean, additive models) as essential reference points, as these provide critical context for evaluating whether model complexity translates to meaningful performance improvements [5] [14].
The consistent pattern of simple baselines matching or exceeding foundation model performance in zero-shot perturbation prediction represents both a challenge and opportunity for the single-cell biology community. Rather than invalidating the foundation model approach, these findings highlight the immaturity of current methodologies and the need for more biologically-grounded architectures, improved pretraining strategies, and more rigorous evaluation practices.
The most promising research directions include developing pretraining objectives that better capture causal relationships rather than correlations, incorporating explicit biological knowledge through structured data sources, and creating more challenging benchmarks with higher perturbation-specific signal. Additionally, parameter-efficient fine-tuning techniques that preserve foundational knowledge while adapting to specific tasks represent a practical path forward for applying these models to real-world discovery settings.
For researchers and drug development professionals, the current evidence suggests a cautious approach to adopting foundation models for critical perturbation prediction tasks. While their theoretical potential remains substantial, practical implementations should prioritize robust benchmarking against simple alternatives and selective application to tasks where they demonstrate clear, measurable advantages over more straightforward approaches.
The development of single-cell foundation models (scFMs) like scGPT and scFoundation represents a transformative advance in computational biology, promising to predict cellular responses to genetic and chemical perturbations with high accuracy [1]. These transformer-based models are pre-trained on millions of single-cell transcriptomes to learn fundamental principles of gene regulation and signaling, then fine-tuned for specific prediction tasks [12]. However, rigorous benchmarking studies have revealed surprising limitations in current evaluation paradigms, particularly stemming from low perturbation-specific variance in commonly used benchmark datasets [5]. This challenge fundamentally undermines our ability to accurately assess model performance and compare competing approaches.
The core issue identified in recent research is that standard perturbation datasets exhibit minimal variance that can be specifically attributed to the perturbations themselves, as opposed to general biological variation or technical noise [5]. When the signal of interest is weak relative to background variation, even simple baseline models can appear to perform comparably to sophisticated foundation models, making meaningful comparison difficult. This problem is compounded by the predominance of perturbation-exclusive (PEX) benchmarking setups that test only a model's ability to generalize to novel perturbations in familiar cell types, rather than evaluating performance across diverse cellular contexts [5]. Understanding and addressing this low-variance challenge is crucial for advancing the field of predictive cellular modeling.
To ensure fair comparison across models, researchers have established comprehensive benchmarking protocols that test performance across multiple dimensions. The most rigorous evaluations employ several key Perturb-seq datasets covering different perturbation types and cell lines [5]:
The evaluation methodology follows a standardized workflow to ensure reproducible and comparable results across models [5]. Predictions are generated at the single-cell level, then aggregated to form pseudo-bulk expression profiles for each perturbation. These predicted profiles are compared against ground truth data using multiple correlation metrics:
Surprisingly, benchmarking results have demonstrated that even simple baseline models can outperform sophisticated foundation models on standard perturbation prediction tasks. The performance gap becomes particularly evident when evaluating in differential expression space, which more specifically captures perturbation effects [5].
Table 1: Performance Comparison Across Models (Pearson Δ Correlation)
| Model | Adamson | Norman | Replogle K562 | Replogle RPE1 |
|---|---|---|---|---|
| Train Mean | 0.711 | 0.557 | 0.373 | 0.628 |
| scGPT | 0.641 | 0.554 | 0.327 | 0.596 |
| scFoundation | 0.552 | 0.459 | 0.269 | 0.471 |
| RF + GO Features | 0.739 | 0.586 | 0.480 | 0.648 |
| RF + scGPT Embeddings | 0.727 | 0.583 | 0.421 | 0.635 |
The "Train Mean" baseline simply predicts the average expression profile from training examples for all perturbations, yet it consistently outperforms both scGPT and scFoundation across most datasets [5]. Even more strikingly, random forest models using biologically informed features (Gene Ontology vectors) substantially outperform all foundation models, suggesting that current scFMs may not be effectively leveraging their pretrained biological knowledge for perturbation prediction tasks.
The underlying issue with current benchmarking approaches stems from the low perturbation-specific variance in commonly used datasets. When the expression changes induced by perturbations are minimal compared to background biological variation and technical noise, models struggle to identify true signal, and benchmarking becomes unreliable [5].
Table 2: Characteristics of Perturbation Benchmark Datasets
| Dataset | Cell Count | Perturbation Type | Perturbation Variance | Primary Limitation |
|---|---|---|---|---|
| Adamson | 68,603 | CRISPRi (single) | Low | Minimal expression changes |
| Norman | 91,205 | CRISPRa (single/dual) | Low-Medium | Combinatorial complexity |
| Replogle K562 | ~162,750 | CRISPRi (genome-wide) | Low | High background noise |
| Replogle RPE1 | ~162,750 | CRISPRi (genome-wide) | Low | Cell-type specific effects |
The fundamental problem is that these datasets were primarily designed to detect differentially expressed genes rather than to train complex predictive models. Consequently, the effect sizes for most perturbations are quite small, with only subtle changes to the transcriptional landscape [5]. This creates a scenario where models that simply learn to predict average expression patterns can appear deceptively competent, as they minimize overall error without truly capturing perturbation-specific effects.
The low variance problem manifests in several specific challenges for benchmarking:
Recent studies have demonstrated that the low variance issue is particularly problematic for transformer-based foundation models, which may require larger effect sizes to effectively leverage their attention mechanisms and capture meaningful gene-gene interactions [5]. Simpler models based on biological priors (like GO term embeddings) appear somewhat more robust to this challenge, potentially because they incorporate external knowledge that helps distinguish signal from noise.
To address the limitations of standard correlation-based metrics, researchers have developed more sophisticated evaluation approaches that specifically target perturbation effects:
The zero-shot evaluation approach has been particularly revealing, demonstrating that both scGPT and Geneformer underperform simpler methods like highly variable gene selection or established integration tools like Harmony and scVI when applied without task-specific fine-tuning [16]. This suggests that current pretraining objectives may not effectively capture transferable biological principles.
Addressing the low variance challenge requires both improved datasets and more sophisticated analytical approaches:
Recent model development has begun to address these challenges. For instance, CellFM—trained on 100 million human cells with 800 million parameters—shows improved performance in gene function prediction and cell annotation tasks, though its perturbation prediction capabilities still require comprehensive evaluation [12].
Table 3: Essential Computational Tools for Perturbation Modeling
| Resource | Type | Primary Function | Application in Benchmarking |
|---|---|---|---|
| scGPT | Foundation Model | Gene expression prediction | Baseline for transformer-based approaches [5] |
| scFoundation | Foundation Model | Masked autoencoding of expression | Comparison model for benchmarking [5] |
| CellFM | Large-scale Foundation Model | Multi-task single-cell analysis | Emerging model with 100M cell training [12] |
| Geneformer | Foundation Model | Rank-based gene modeling | Zero-shot performance evaluation [16] |
| Harmony | Integration Tool | Batch effect correction | Baseline for dataset integration [16] |
| scVI | Probabilistic Model | Dimensionality reduction | Reference for clustering performance [16] |
| CELLxGENE | Data Platform | Curated single-cell data | Source of standardized training data [1] |
| Perturb-seq | Technology | CRISPR screening + scRNA-seq | Primary data generation method [5] |
The benchmarking challenges posed by low-variance perturbation datasets represent a critical obstacle for advancing single-cell foundation models. Current evidence suggests that sophisticated transformer-based models like scGPT and scFoundation may not be effectively leveraging their architectural advantages for perturbation prediction, as simpler approaches consistently outperform them on standard benchmarks [5]. This performance gap appears to stem from both dataset limitations and potential shortcomings in model pretraining objectives.
Moving forward, the field requires several key advances: (1) development of higher-quality benchmarking datasets with stronger perturbation effects and richer biological contexts; (2) more sophisticated evaluation metrics that specifically assess perturbation-specific prediction rather than overall expression matching; and (3) improved model architectures and pretraining strategies that better capture causal biological relationships. The recent emergence of even larger models like CellFM trained on 100 million cells suggests that scaling alone may not address these fundamental challenges [12]. Instead, more targeted approaches combining biological prior knowledge with flexible deep learning architectures may be necessary to truly advance the state of the art in perturbation modeling.
As benchmarking methodologies continue to evolve, researchers should prioritize comprehensive evaluation across multiple biological contexts, careful examination of zero-shot capabilities, and rigorous comparison against simple but biologically informed baseline models [16]. Only through such rigorous approaches can we develop foundation models that genuinely advance our ability to predict and understand cellular responses to perturbation.
In the rapidly evolving field of single-cell genomics, foundation models like scGPT and scFoundation represent a significant leap forward, leveraging transformer architectures to interpret cellular "language" [1]. These models are pre-trained on millions of single-cell transcriptomes, learning fundamental principles of gene regulation and cell state that can be adapted to various downstream tasks through fine-tuning [1]. The core challenge, however, lies in effectively adapting these massive models to specific biological questions—such as predicting cellular responses to genetic perturbations—without requiring prohibitive computational resources or falling prey to overfitting on limited experimental data.
Recent comprehensive benchmarking studies have revealed surprising limitations in these foundation models. When evaluated for predicting post-perturbation gene expression profiles, even the simplest baseline models—such as predicting the mean expression from training examples—frequently outperformed sophisticated foundation models like scGPT and scFoundation [5] [14]. These findings highlight the critical importance of selecting appropriate optimization strategies when adapting pre-trained models, making the comparison between full fine-tuning, layer freezing, and Low-Rank Adaptation (LoRA) not merely technical but essential for advancing biological discovery.
Independent benchmarking studies have systematically evaluated scGPT and scFoundation against deliberately simple baselines for predicting transcriptome changes after genetic perturbations. The results have been sobering for proponents of large foundation models. Across multiple Perturb-seq datasets—including studies by Adamson, Norman, and Replogle—foundation models generally underperformed compared to a simple baseline that predicts the mean of training samples (Train Mean) [5]. Furthermore, standard machine learning models like Random Forest regressors, when equipped with biologically meaningful features such as Gene Ontology (GO) vectors, outperformed foundation models by a large margin [5].
A study published in Nature Methods (2025) reached similar conclusions, finding that no deep learning model could consistently outperform simple linear baselines or the mean prediction for forecasting the effects of unseen single or double perturbations [14]. This research also discovered that using the gene embeddings learned by scGPT and scFoundation within a simple linear model often achieved better performance than the fine-tuned foundation models themselves, suggesting that the pretrained representations contain valuable information that may be lost or poorly utilized during full fine-tuning [14].
Table 1: Benchmarking Results on Perturbation Prediction Tasks (Pearson Correlation in Differential Expression Space)
| Model / Dataset | Adamson | Norman | Replogle K562 | Replogle RPE1 |
|---|---|---|---|---|
| Train Mean | 0.711 | 0.557 | 0.373 | 0.628 |
| scGPT | 0.641 | 0.554 | 0.327 | 0.596 |
| scFoundation | 0.552 | 0.459 | 0.269 | 0.471 |
| RF with GO Features | 0.739 | 0.586 | 0.480 | 0.648 |
| RF with scGPT Embeddings | 0.727 | 0.583 | 0.421 | 0.635 |
Source: Adapted from BMC Genomics [5]
Table 2: Genetic Interaction Prediction Performance (Area Under Curve)
| Model | Performance (AUC) |
|---|---|
| No Change Baseline | 0.72 |
| Additive Model | 0.50 |
| GEARS | 0.67 |
| scGPT | 0.68 |
| scFoundation | 0.65 |
| Geneformer* | 0.58 |
| UCE* | 0.62 |
Source: Adapted from Nature Methods [14]. *Models not designed for this task, used with linear decoder.
The computational resources required for fine-tuning these foundation models is substantial, yet not correlated with performance on perturbation prediction tasks. scFoundation, trained on approximately 50 million human cells with ~0.1 billion parameters, and scGPT, trained on over 33 million human cells, both require significant GPU memory and training time for full fine-tuning [31]. One benchmarking study noted that despite these substantial computational investments, the foundation models were consistently outperformed by simpler, more efficient approaches [14].
Methodology: Full fine-tuning involves continuing the training process of all layers and parameters in a pre-trained model on a new, task-specific dataset. The entire weight matrix (W₀) is updated to W = W₀ + ΔW through backpropagation [32].
Advantages and Disadvantages: This approach typically provides the highest baseline accuracy and task performance since the model can fully adjust all parameters to the new data [33] [34]. However, it demands enormous computational resources—often requiring multi-GPU setups (A100/H100) and substantial training time [35] [33]. Full fine-tuning also risks catastrophic forgetting, where the model over-specializes to the fine-tuned task and loses general knowledge acquired during pre-training [35].
Applications in Single-Cell Analysis: In the context of scGPT and scFoundation, full fine-tuning would theoretically allow the model to completely adapt its understanding of gene-gene relationships to specific perturbation contexts. However, given the limited size of most perturbation datasets (often with only hundreds of perturbations), this approach is prone to overfitting, potentially explaining the poor benchmarking performance observed in recent studies [5] [14].
Methodology: Layer freezing, a form of specification-based parameter-efficient fine-tuning, involves fine-tuning only a subset of the model's layers while keeping the majority frozen [36]. For example, researchers might freeze the earlier layers of scGPT that capture general gene relationships while unfreezing and fine-tuning only the final layers for task-specific adaptation.
Advantages and Disadvantages: This approach significantly reduces computational requirements and mitigates catastrophic forgetting by preserving the foundational knowledge in frozen layers [36]. The tradeoff is potentially lower task performance compared to full fine-tuning, as the model has limited adaptability. A critical consideration is determining which layers to freeze—a decision that often requires domain expertise and experimentation [36].
Evidence from Single-Cell Research: Studies evaluating parameter-efficient methods for pre-trained models in annotating scRNA-seq data have found that freezing layers tuning (FL) can achieve performance comparable to vanilla fine-tuning while dramatically reducing tunable parameters [36]. When applied to scBERT (a transformer model for single-cell data), layer freezing maintained strong performance on cell type annotation tasks while offering significant efficiency gains [36].
Methodology: LoRA is a reparameterization-based PEFT method that freezes the pre-trained model weights and injects trainable rank decomposition matrices into transformer layers [32] [36]. Instead of updating the entire weight matrix ΔW, LoRA approximates it with the product of two smaller matrices ΔW = BA, where B and A have much lower dimensions [32]. For a layer with d×d parameters, LoRA reduces trainable parameters to 2×d×r, where r is the rank (typically 4-64) [32].
Advantages and Disadvantages:
The primary challenge lies in balancing rank and performance—lower ranks save resources but may not capture task complexity [32].
Applications in Single-Cell Foundation Models: The recently introduced CellFM model incorporates LoRA modules to reduce trainable parameters during fine-tuning for new datasets [31]. This approach demonstrates how LoRA can enable efficient adaptation of large single-cell foundation models (800M parameters in CellFM) without compromising performance across diverse applications like cell annotation and perturbation prediction [31].
Table 3: Comparison of Fine-Tuning Strategies for Single-Cell Foundation Models
| Feature | Full Fine-Tuning | Layer Freezing | LoRA |
|---|---|---|---|
| Trainable Parameters | 100% | 1-20% | 1-5% |
| GPU Memory Requirements | Very High | Moderate | Low |
| Task Performance | Highest (theoretical) | Moderate | Near-full |
| Risk of Overfitting | High | Moderate | Low |
| Training Speed | Slow | Moderate | Fast |
| Inference Overhead | None | None | None (when merged) |
| Multiple Task Support | Poor (separate model per task) | Moderate | Excellent (adapter swapping) |
Dataset Preparation: The standard benchmarking protocol utilizes multiple Perturb-seq datasets to evaluate generalization capabilities [5] [14]:
Evaluation Methodology:
Baseline Models: Include simple baselines like Train Mean (average of training pseudo-bulk profiles) and Random Forest regressors with biological features (GO vectors, gene embeddings) [5].
Full Fine-Tuning Protocol:
Layer Freezing Protocol:
LoRA Implementation Protocol:
Diagram 1: Architectural comparison of the three optimization strategies, showing parameter update patterns and data flow during fine-tuning. (Max Width: 760px)
Diagram 2: Benchmarking workflow for evaluating optimization strategies on single-cell foundation models using perturbation prediction tasks. (Max Width: 760px)
Table 4: Key Computational Tools and Resources for Single-Cell Foundation Model Research
| Tool/Resource | Type | Function | Relevance to Optimization Strategies |
|---|---|---|---|
| Hugging Face PEFT | Software Library | Parameter-Efficient Fine-Tuning | Implements LoRA, Adapter methods for transformer models |
| scGPT Framework | Model Architecture | Single-Cell Foundation Model | Target for optimization strategies; provides pre-trained weights |
| scFoundation Model | Model Architecture | Single-Cell Foundation Model | Comparison model for benchmarking studies |
| Perturb-seq Datasets | Experimental Data | Benchmark Validation | Adamson, Norman, Replogle datasets for evaluating perturbation prediction |
| Gene Ontology (GO) Vectors | Biological Prior Knowledge | Feature Representation | Biological features for baseline models; enhances interpretability |
| MindSpore/PyTorch | AI Framework | Model Training & Inference | Computational backbone for implementing optimization strategies |
| CellFM | Integrated Framework | Large-scale scFM with LoRA | Example of LoRA integration in production-scale model [31] |
The benchmarking evidence clearly indicates that despite their theoretical promise, single-cell foundation models like scGPT and scFoundation do not currently outperform simple baselines for perturbation prediction tasks [5] [14]. This surprising finding underscores the importance of rigorous evaluation and suggests that model size and pre-training scale alone are insufficient for mastering cellular response prediction.
Based on the comprehensive analysis of optimization strategies, we recommend:
Start with Simple Baselines: Before investing in foundation model fine-tuning, establish performance baselines using Random Forest models with biological features like GO terms or pre-computed gene embeddings [5].
Prioritize LoRA for Foundation Model Adaptation: When fine-tuning scGPT or similar models, LoRA provides the best balance of efficiency and performance, achieving near-full fine-tuning results with dramatically reduced resources [32] [31].
Use Layer Freezing for Transfer Learning: When adapting foundation models to conceptually similar tasks (e.g., different cell types), layer freezing offers a practical middle ground with reduced overfitting risk [36].
Reserve Full Fine-Tuning for Data-Rich Scenarios: Only consider full fine-tuning when you have large, high-quality perturbation datasets and ample computational resources—and even then, temper performance expectations based on current benchmarking results [5] [14].
The field of single-cell foundation models remains young, and current limitations in perturbation prediction likely reflect both methodological challenges and the inherent complexity of biological systems. As model architectures, training strategies, and optimization techniques continue to mature, the careful application of these adaptation strategies will be crucial for translating computational advances into biological insights.
The emergence of single-cell foundation models (scFMs), such as scGPT and scFoundation, has heralded a new era in computational biology, promising to decode the complex language of cellular processes from vast single-cell RNA sequencing (scRNA-seq) datasets. These models, often built on transformer architectures, are pre-trained on millions of cells to learn fundamental representations of gene regulation and cell states, which can then be fine-tuned for specific downstream tasks like perturbation prediction and cell type annotation [1]. However, recent rigorous benchmarking studies have revealed a critical insight: while these models learn powerful embeddings, their standalone performance in specific tasks, such as predicting gene perturbation effects, often fails to surpass deliberately simple baselines [5] [14]. This surprising finding has directed attention toward a more promising approach—strategically combining the latent representations learned by foundation models with structured biological prior knowledge. This guide provides a comprehensive comparison of this hybrid methodology, evaluating its performance against standalone models and detailing the experimental protocols that enable researchers to effectively leverage these integrated approaches for enhanced biological discovery.
Recent comprehensive benchmarks have consistently demonstrated that scFMs, including scGPT and scFoundation, frequently underperform simple baseline models in critical prediction tasks. One landmark study found that even the simplest baseline—predicting the mean expression profile from training data—outperformed both scGPT and scFoundation in predicting post-perturbation gene expression profiles across four different Perturb-seq datasets [5]. Similarly, a benchmark published in Nature Methods concluded that "deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines," with none of the five evaluated foundation models surpassing a straightforward additive model [14].
The performance gap is particularly evident in differential expression prediction. In the Adamson dataset, the Train Mean baseline achieved a Pearson Delta correlation of 0.711, outperforming scGPT (0.641) and scFoundation (0.552). This pattern persisted across datasets, with Random Forest regression using Gene Ontology (GO) features substantially outperforming both foundation models (0.739 vs. 0.641 and 0.552, respectively, on the Adamson dataset) [5]. These results indicate that the current pretraining paradigms for scFMs may not be effectively capturing the causal relationships necessary for accurate perturbation response prediction.
In response to these limitations, researchers have developed hybrid approaches that combine foundation model embeddings with biological prior knowledge. This integration has demonstrated remarkable success, often bridging the performance gap between standalone foundation models and simple baselines. When scGPT's embeddings were used as features in a Random Forest model instead of being used in the fine-tuned scGPT model itself, performance improved significantly (Pearson Delta: 0.727 vs. 0.641 on the Adamson dataset), though it still trailed Random Forest with GO features (0.739) [5].
Another compelling approach utilizes natural language processing-based gene embeddings from scELMO, which incorporates textual descriptions of genes generated by large language models. Random Forest models using scELMO features achieved competitive performance (0.706 on Adamson) comparable to GO-based models [5]. This suggests that textual biological knowledge can serve as an effective prior when combined with statistical learning methods.
Table 1: Performance Comparison of Prediction Models Across Perturbation Datasets (Pearson Delta Metric)
| Model Category | Specific Model | Adamson | Norman | Replogle K562 | Replogle RPE1 |
|---|---|---|---|---|---|
| Simple Baselines | Train Mean | 0.711 | 0.557 | 0.373 | 0.628 |
| Foundation Models | scGPT | 0.641 | 0.554 | 0.327 | 0.596 |
| scFoundation | 0.552 | 0.459 | 0.269 | 0.471 | |
| Biological Prior Models | RF + GO Features | 0.739 | 0.586 | 0.480 | 0.648 |
| Hybrid Models | RF + scGPT Embeddings | 0.727 | 0.583 | 0.421 | 0.635 |
| RF + scELMO Embeddings | 0.706 | 0.663 | 0.471 | 0.651 |
The relative performance of different modeling approaches varies significantly across biological tasks. While foundation models may struggle with perturbation prediction, they excel in other domains. For drug response prediction, scFoundation achieved the highest mean F1 scores of 0.971 and 0.947 using layer-freezing and fine-tuning strategies, respectively, outperforming the lowest-performing model by over 50% in pooled-data evaluation [22] [11]. In cross-data evaluation for drug response, UCE performed best after fine-tuning (mean F1: 0.774), while scGPT demonstrated superior performance in zero-shot learning (mean F1: 0.858) [22] [11].
This task-dependent performance emphasizes that no single model consistently outperforms others across all scenarios. A comprehensive benchmark of six scFMs confirmed this finding, revealing that model performance must be evaluated in the context of specific applications, with different models excelling in tasks such as cell type annotation, batch integration, and drug sensitivity prediction [4] [3].
Table 2: Model Performance Across Different Biological Tasks
| Task Category | Best Performing Model | Key Metric | Performance | Key Insight |
|---|---|---|---|---|
| Perturbation Prediction | RF + GO Features | Pearson Delta | 0.739 (Adamson) | Biological priors outperform foundation models |
| Drug Response (Pooled-data) | scFoundation | F1 Score | 0.971 | Foundation models excel with sufficient data |
| Drug Response (Cross-data) | UCE (fine-tuned) | F1 Score | 0.774 | Fine-tuning enhances cross-dataset generalization |
| Drug Response (Zero-shot) | scGPT | F1 Score | 0.858 | Strong zero-shot transfer learning capability |
| Cell Type Annotation | scGraphformer | Accuracy | Superior across 20 datasets | Hybrid architecture captures cell-cell relationships |
The first critical step in creating hybrid models is extracting meaningful embeddings from pre-trained foundation models. For scGPT, which uses a transformer architecture with a GPT-based decoder, gene embeddings can be extracted from the input embedding layer [1] [14]. These embeddings typically have a dimensionality of 512 and are designed to capture contextual relationships between genes based on the model's pre-training on millions of single-cell transcriptomes [5].
For scFoundation, which employs an asymmetric encoder-decoder architecture, gene embeddings of dimension 768 can be extracted [4]. The pre-training process for scFoundation uses a read-depth-aware masked gene modeling objective with mean squared error loss, which encourages the model to learn biologically meaningful representations of genes that capture their functional relationships [4] [14].
The extraction protocol typically involves:
The true power of hybrid approaches emerges when foundation model embeddings are integrated with structured biological knowledge. Gene Ontology (GO) provides a comprehensive computational model of biological systems, capturing functional relationships between genes across three domains: biological process, molecular function, and cellular component [5].
The integration protocol typically involves:
A successful implementation of this approach used Random Forest regression with GO features, which substantially outperformed standalone foundation models across multiple perturbation datasets [5]. The model took as input GO vectors representing the perturbed genes and achieved a Pearson Delta correlation of 0.739 on the Adamson dataset, compared to 0.641 for scGPT.
Robust evaluation is essential for comparing hybrid approaches against standalone models. The benchmarking pipeline should include:
Data Splitting Strategies:
Performance Metrics:
Baseline Models:
Table 3: Key Research Reagents and Computational Tools for Hybrid Representation Learning
| Category | Resource Name | Specifications/Features | Primary Application |
|---|---|---|---|
| Foundation Models | scGPT | 50M parameters, 512D embeddings, GPT-based decoder | Gene embedding extraction, perturbation prediction |
| scFoundation | 100M parameters, 768D embeddings, encoder-decoder | Large-scale representation learning | |
| Biological Knowledge Bases | Gene Ontology (GO) | Functional term relationships across three domains | Biological prior feature engineering |
| KEGG Pathways | Curated pathway maps and functional hierarchies | Pathway-aware model integration | |
| REACTOME | Detailed curated biological pathway database | Biological validation and interpretation | |
| Benchmark Datasets | Perturb-seq Datasets | Adamson, Norman, Replogle (K562, RPE1) | Perturbation prediction benchmarking |
| Drug Response Collections | scDrugMap (326,751 cells, 36 datasets) | Drug sensitivity prediction evaluation | |
| Computational Frameworks | scDrugMap | Python CLI and web server for drug response | End-to-end model evaluation platform |
| scGraphformer | Transformer-GNN hybrid architecture | Cell type annotation and relationship learning | |
| Evaluation Metrics | scGraph-OntoRWR | Cell ontology-informed consistency metric | Biological relevance assessment |
| LCAD (Lowest Common Ancestor Distance) | Ontological proximity for misclassification error | Biological meaningfulness of errors |
The integration of foundation model embeddings with biological prior knowledge represents a powerful paradigm for enhancing representation learning in computational biology. While standalone foundation models like scGPT and scFoundation have demonstrated impressive capabilities in certain domains, particularly drug response prediction with sufficient data, their performance in critical tasks like perturbation prediction often trails simpler approaches that explicitly incorporate biological knowledge. The hybrid methodologies detailed in this guide—which combine the latent representations learned by foundation models with structured biological knowledge from sources like Gene Ontology—consistently achieve superior performance across multiple benchmarking scenarios.
This comparative analysis reveals that the future of biological representation learning lies not in increasingly larger foundation models alone, but in the thoughtful integration of these models with the rich structured knowledge accumulated through decades of biological research. As the field advances, the most impactful approaches will likely be those that can most effectively bridge the gap between data-driven representation learning and established biological principles, creating models that are both statistically powerful and biologically meaningful.
The emergence of single-cell foundation models (scFMs) like scGPT and scFoundation promises to revolutionize biological discovery by providing a unified framework for analyzing cellular transcriptomes. A core claim of these models is that their embeddings—internal representations of cellular states—can effectively separate biological signals from non-biological noise, a capability paramount for robust single-cell analysis [13]. Technical variability, or "batch effects," introduced by different sequencing protocols, laboratories, or experimental conditions, poses a significant obstacle to this goal. If not properly corrected, these artifacts can obscure true biological differences, leading to misleading conclusions in downstream analyses [37]. Therefore, the ability of a model to generate embeddings that are invariant to technical confounders while preserving biological heterogeneity is a critical benchmark for its utility in real-world research and drug development. This guide objectively compares the performance of scGPT and scFoundation in mitigating batch effects, synthesizing the latest experimental data to inform their practical application.
Recent independent benchmarking studies have rigorously evaluated scGPT and scFoundation against simpler models and each other. The results reveal distinct performance profiles, particularly in perturbation prediction and batch integration tasks. The table below summarizes key quantitative findings.
Table 1: Performance Comparison of scGPT and scFoundation on Key Benchmarks
| Model | Primary Architecture | Performance on Perturbation Prediction (Pearson Delta, Mean across Datasets) | Performance on Batch Integration (iLISI score, example datasets) | Zero-shot Cell Type Clustering (AvgBIO score) | Key Strengths | Key Limitations |
|---|---|---|---|---|---|---|
| scGPT | Transformer-based (Value Categorization) | 0.530 (Adamson, Norman, Replogle K562 & RPE1) [5] | Outperforms scVI on complex batch effects (Tabula Sapiens) [16] | Inconsistent; outperformed by HVG and scVI on most datasets [16] | Robust performance across multiple tasks; can handle complex technical and biological batch effects [16] [17] | Underperforms simple baselines in perturbation prediction; inconsistent zero-shot clustering [5] [16] |
| scFoundation | Transformer-based (Masked Autoencoder) | 0.438 (Adamson, Norman, Replogle K562 & RPE1) [5] | Specific batch integration performance not detailed in results | Specific zero-shot clustering performance not detailed in results | Provides biologically meaningful gene embeddings [5] [17] | Underperforms simple baselines in perturbation prediction; requires specific gene sets, limiting applicability [5] [14] |
| Simple Baseline (Train Mean) | N/A | 0.567 (Adamson, Norman, Replogle K562 & RPE1) [5] | N/A | N/A | Surprisingly strong baseline for perturbation prediction [5] [14] | Incapable of capturing complex biological interactions [14] |
| Random Forest + GO Features | Ensemble Learning | 0.613 (Adamson, Norman, Replogle K562 & RPE1) [5] | N/A | N/A | Outperforms foundation models by a large margin in perturbation prediction [5] | Relies on prior biological knowledge (GO terms) |
A critical insight from these benchmarks is that even simple models can rival or exceed the performance of large foundation models in specific tasks like perturbation prediction. For instance, a baseline that simply predicts the mean expression from the training data outperformed both scGPT and scFoundation across several datasets [5] [14]. Furthermore, a Random Forest model using Gene Ontology (GO) features "outperformed foundation models by a large margin" [5]. This suggests that the current general-purpose representations learned by scFMs may not yet be superior to task-specific models that incorporate curated biological knowledge.
Table 2: Analysis of Model Embeddings for Downstream Tasks
| Embedding Type | Source Model | Utility in Downstream Prediction | Biological Meaningfulness |
|---|---|---|---|
| Gene Embeddings | scGPT | Effective when used in a simple linear model, outperforming scGPT's own fine-tuned decoder [14] | Captures some gene-gene relationships [5] |
| Gene Embeddings | scFoundation | Effective when used in a simple linear model [14] | Provides biologically meaningful features [17] |
| Gene Embeddings | scELMO | Similar performance to GO-based Random Forest models [5] | Derived from LLM-generated gene descriptions |
| Perturbation Embeddings | GEARS | Enables linear models to perform on par with the original model [14] | Encodes perturbation relationships |
To ensure reproducibility and critical evaluation, understanding the standard protocols for benchmarking batch effect correction and embedding quality is essential. The following workflow outlines a typical evaluation pipeline.
Benchmarks typically use publicly available scRNA-seq datasets with known, pronounced batch effects, such as the Pancreas dataset, which combines data from five different sources [16]. A critical step is pseudo-bulk creation, where gene expression profiles for each perturbation or condition are averaged to form a more stable profile for comparison [5]. Evaluation is often conducted under two main setups:
The performance of embedding models is quantified using metrics that balance batch correction with biological fidelity.
Table 3: Key Metrics for Evaluating Embedding Quality
| Metric | What It Measures | Interpretation |
|---|---|---|
| iLISI (Graph Integration Local Inverse Simpson's Index) | Batch mixing in local cell neighborhoods [37] | Higher scores indicate better batch integration. |
| NMI (Normalized Mutual Information) | Preservation of cell type identity after integration [37] | Higher scores indicate better preservation of biological signal. |
| Pearson Delta | Correlation between predicted and actual differential expression profiles [5] | Higher scores indicate more accurate perturbation prediction. |
| ASW (Average Silhouette Width) & AvgBIO | Cell type separation and clustering accuracy [16] | Higher scores indicate better separation of distinct cell types. |
The fundamental challenge in batch correction is to remove technical artifacts without erasing meaningful biological variation. The following diagram illustrates this problem and the desired outcome.
Successful benchmarking and application of scFMs require a suite of computational tools and data resources.
Table 4: Key Resources for scFM Research
| Resource Name | Type | Primary Function | Relevance to Benchmarking |
|---|---|---|---|
| BioLLM Framework | Software Framework | Provides unified APIs for applying and evaluating different scFMs [17] | Standardizes model access and evaluation, enabling fair comparisons. |
| Perturb-seq Datasets | Data | Provides ground truth data on gene expression before/after genetic perturbation (e.g., Adamson, Norman, Replogle data) [5] [14] | Essential for evaluating model performance on perturbation prediction tasks. |
| CZ CELLxGENE | Data Platform | A curated corpus of standardized single-cell datasets [13] | Source of diverse, high-quality data for model pretraining and testing. |
| Harmony & scVI | Software Tools | Established methods for data integration and embedding generation [16] | Critical baseline models for benchmarking the performance of newer scFMs. |
| Gene Ontology (GO) | Knowledge Base | A structured repository of gene function annotations [5] | Used to create biologically meaningful features for baseline models. |
The emergence of large-scale foundation models trained on massive single-cell transcriptomics datasets has revolutionized computational biology, offering the potential to capture complex gene-gene relationships and cellular states. Among these, scGPT and scFoundation represent two leading approaches in the landscape of single-cell artificial intelligence. Within the specific context of drug response prediction—a critical task in oncology and therapeutic development—rigorous benchmarking is essential to guide model selection and application. This comparison guide focuses on evaluating these models under pooled-data evaluation scenarios, where models are trained and tested on aggregated data from multiple studies. This approach tests a model's ability to integrate diverse data sources and extract generalizable patterns, a capability with immense value for real-world drug discovery applications. Recent comprehensive studies, particularly the scDrugMap benchmark, have provided the community with robust, data-driven insights into the comparative performance of these models, consistently highlighting scFoundation's superior predictive capabilities in this specific evaluation setting [21] [11].
The scDrugMap benchmark, an extensive framework for evaluating foundation models in drug response prediction, provides clear quantitative results from pooled-data evaluation. The table below summarizes the key performance metrics, measured by the F1 score, for the leading models.
Table 1: Model Performance in Pooled-Data Evaluation on Primary Data Collection (scDrugMap Benchmark)
| Foundation Model | Training Strategy | Mean F1 Score | Performance Notes |
|---|---|---|---|
| scFoundation | Layer Freezing | 0.971 | Highest performing model; outperformed lowest by 54% [21]. |
| scFoundation | Fine-tuning (LoRA) | 0.947 | Highest performing fine-tuned model [21] [11]. |
| UCE | Fine-tuning (LoRA) | 0.774 | Top performer in cross-data evaluation after fine-tuning [21]. |
| scGPT | Zero-shot Learning | 0.858 | Demonstrated superior performance in zero-shot setting [21]. |
| scBERT | Layer Freezing | 0.630 | Lowest performing model in this benchmark [21]. |
The results demonstrate that scFoundation achieved the highest mean F1 scores of 0.971 (with layer freezing) and 0.947 (with fine-tuning using Low-Rank Adaptation) in the pooled-data evaluation on the primary collection of 326,751 single cells [21] [11]. This indicates that when data from multiple sources are aggregated, scFoundation's pretraining and architecture provide a significant advantage in accurately distinguishing between drug-sensitive and drug-resistant cells.
Understanding the experimental design behind these conclusions is crucial for interpreting the results.
The scDrugMap benchmark provides a standardized environment for a fair comparison. Its key components include [21]:
The performance differences can be traced back to architectural and pretraining choices:
The direct prediction of raw expression values by scFoundation may contribute to its advantage in capturing subtle, biologically relevant signals necessary for predicting complex phenotypes like drug response.
The following diagram illustrates the logical workflow of the scDrugMap benchmarking process that leads to scFoundation's top-tier performance in pooled-data evaluation.
Diagram 1: scDrugMap Benchmarking Workflow. This workflow outlines the key stages in the scDrugMap pooled-data evaluation, from data input and model processing to the final performance assessment that identified scFoundation's superior performance.
For researchers aiming to reproduce or build upon these benchmarks, the following table details essential computational resources and their functions.
Table 2: Essential Research Reagents and Computational Resources
| Resource / Solution | Function in Evaluation | Source / Reference |
|---|---|---|
| scDrugMap Framework | Provides the integrated benchmarking environment, data loaders, and evaluation scripts. | https://scdrugmap.com/ [21] |
| Primary Data Collection | The curated set of 326,751 single cells from 36 datasets; serves as the primary benchmark. | Manually curated from 23 published studies [21] |
| Low-Rank Adaptation (LoRA) | A parameter-efficient fine-tuning strategy used to adapt foundation models to the drug response task. | Hu et al. (2021) [21] |
| Pre-trained Model Weights (scFoundation) | The foundational parameters of scFoundation, enabling transfer learning. | https://aigp.biomap.com/ [39] |
| Pre-trained Model Weights (scGPT) | The foundational parameters of scGPT for comparative analysis. | Cui et al. (2024) [12] |
The consistent findings from the scDrugMap benchmark firmly establish scFoundation as the leading model for drug response prediction in pooled-data evaluation scenarios. Its superior performance, evidenced by an F1 score exceeding 0.97, underscores the effectiveness of its value-projection-based architecture and large-scale pretraining when learning from aggregated, multi-study datasets. This capability is directly applicable to real-world drug discovery efforts that seek to integrate diverse experimental data to build robust predictive models.
However, model selection is context-dependent. While scFoundation excels in pooled-data evaluation, scGPT has demonstrated superior performance in zero-shot learning settings [21], and other models like UCE perform well in cross-data evaluations [21]. Therefore, the choice between scFoundation and alternatives should be guided by the specific experimental design and application requirements. Researchers are encouraged to leverage the scDrugMap platform and the resources outlined in this guide to conduct their own validations, further solidifying the evidence-based application of single-cell foundation models in accelerating therapeutic development.
The emergence of single-cell foundation models (scFMs) represents a transformative development in computational biology, promising to unlock deeper insights into cellular behavior and accelerate therapeutic discovery. These models, pre-trained on millions of single-cell transcriptomes, aim to capture universal biological principles that can be adapted to diverse downstream tasks. Among the most prominent scFMs are scGPT and scFoundation, both transformer-based architectures trained at unprecedented scale. However, rigorous independent benchmarking has revealed critical insights about their respective capabilities and limitations, particularly regarding cross-data generalization and zero-shot performance—the ability to perform tasks without task-specific training. This comparative analysis synthesizes evidence from multiple recent studies to evaluate scGPT's performance against scFoundation and other alternatives, focusing specifically on generalization capabilities that are essential for real-world biomedical applications where labeled data is scarce or novel cell types and perturbations are encountered.
Predicting cellular responses to genetic perturbations constitutes a fundamental test for scFMs' understanding of gene regulatory networks. Recent benchmarks have evaluated scGPT and scFoundation against simpler baseline models on standardized Perturb-seq datasets, with revealing results.
Table 1: Performance Comparison on Perturbation Response Prediction (Pearson Correlation in Differential Expression Space)
| Model | Adamson Dataset | Norman Dataset | Replogle K562 | Replogle RPE1 |
|---|---|---|---|---|
| Train Mean (Simplest Baseline) | 0.711 | 0.557 | 0.373 | 0.628 |
| scGPT | 0.641 | 0.554 | 0.327 | 0.596 |
| scFoundation | 0.552 | 0.459 | 0.269 | 0.471 |
| Random Forest with GO Features | 0.739 | 0.586 | 0.480 | 0.648 |
| Random Forest with scGPT Embeddings | 0.727 | 0.583 | 0.421 | 0.635 |
Surprisingly, even the simplest baseline—predicting the mean of training examples—outperformed both foundation models across all datasets [5]. Similarly, a Nature Methods study found that "deep-learning-based foundation models did not perform better than deliberately simplistic linear prediction models" for predicting gene perturbation effects [14]. However, when scGPT's pretrained gene embeddings were used in simpler random forest models, performance improved substantially, suggesting these embeddings do contain biologically meaningful information that the full fine-tuned model fails to leverage optimally [5].
Zero-shot evaluation tests models' ability to perform tasks without any task-specific training, which is critical for exploratory research where labels are unknown. Recent benchmarking reveals important patterns in how scGPT and scFoundation perform in these settings.
Table 2: Zero-Shot Performance Across Critical Tasks (Relative Performance Ranking)
| Task | scGPT | scFoundation | Geneformer | Simple Baselines (HVG, etc.) |
|---|---|---|---|---|
| Cell Type Clustering | Intermediate | Limited data | Lowest performance | Highest performance |
| Batch Integration | Variable | Limited data | Lowest performance | Highest performance |
| Unseen Drug Prediction | Strong (F1: 0.858) | Not top performer | Not evaluated | Intermediate |
| Unseen Cell Line Prediction | State-of-the-art | Not top performer | Not evaluated | Lower performance |
In zero-shot cell type clustering and batch integration, both scGPT and Geneformer were consistently outperformed by simpler methods like Highly Variable Genes (HVG) selection and established integration tools like Harmony and scVI [16] [15]. However, scGPT demonstrated remarkable zero-shot capability in specific generalization tasks, achieving superior performance (F1 score: 0.858) in predicting responses to unseen drugs according to scDrugMap benchmarking [22] [11]. Additionally, scGPT-based approaches enabled "zero-shot generalization to unseen cell lines," representing a significant advancement for drug discovery applications [29].
Recent independent evaluations have established rigorous methodologies for assessing scFMs. The benchmarking protocol for perturbation prediction typically involves fine-tuning pre-trained models on specific datasets followed by held-out evaluation [5] [14]. For genetic perturbation prediction, models are trained on single-gene perturbations and evaluated on their ability to predict effects of double-gene perturbations or unseen single-gene perturbations [5] [14]. The key innovation in recent benchmarks is the inclusion of deliberately simple baselines like "mean prediction" and linear models, which provide reality checks on claimed capabilities [5] [14].
Zero-shot evaluation protocols differ significantly, as they exclude task-specific fine-tuning entirely [16] [15]. In these frameworks, models generate cell embeddings that are directly evaluated on tasks like cell type clustering and batch correction using metrics such as Average BIO score (AvgBio) for clustering accuracy and Principal Component Regression (PCR) for batch effect removal [16] [15]. The scDrugMap framework introduces both pooled-data evaluation (standard fine-tuning) and cross-data evaluation (assessing generalization to novel contexts) [22] [11].
Benchmarking studies employ specialized metrics tailored to each task. For perturbation prediction, the Pearson Delta metric—correlation between predicted and actual differential expression profiles—has emerged as particularly informative because it focuses on expression changes rather than absolute values, which are dominated by highly expressed genes [5]. Additional metrics include L2 distance for top differentially expressed genes and genetic interaction detection capability [14].
For zero-shot evaluation, Average BIO score measures cell type clustering quality, while batch integration metrics quantify a model's ability to remove technical artifacts while preserving biological variation [16] [15]. In drug response prediction, F1 scores evaluate classification accuracy, particularly important for imbalanced datasets common in pharmaceutical applications [22] [11].
The most commonly used datasets in recent benchmarks include:
Table 3: Key Experimental Resources for Single-Cell Foundation Model Evaluation
| Resource Category | Specific Examples | Function in Evaluation |
|---|---|---|
| Perturbation Datasets | Norman et al. (2019), Adamson et al. (2016), Replogle et al. (2022) | Provide ground truth for evaluating perturbation prediction capabilities |
| Benchmarking Platforms | scDrugMap, GEARS, Custom benchmarking pipelines | Standardized evaluation frameworks for fair model comparison |
| Evaluation Metrics | Pearson Delta, AvgBio score, F1 score, PCR score | Quantify model performance across different task types |
| Baseline Models | Train Mean, Random Forest with GO features, HVG selection | Provide critical reference points for interpreting model performance |
| Integration Tools | Harmony, scVI, Seurat | Established methods for comparison on integration tasks |
Despite overall mixed performance across benchmarks, scGPT demonstrates notable strengths in specific generalization scenarios. The model excels particularly in cross-data evaluation and zero-shot drug response prediction, suggesting its pre-training on 33 million human cells has conferred meaningful biological understanding that transfers to novel contexts [22] [11] [29]. This capability is particularly valuable for drug discovery, where researchers need to predict compound effects on cell types or disease states not included in training data.
The architecture of scGPT, which uses a perturbation token added to the perturbed gene token to model perturbation effects, appears to provide a flexible framework for generalizing to novel conditions [5] [29]. Additionally, scGPT's strong performance when its embeddings are used in simpler models indicates that the pre-training process successfully captures biologically meaningful relationships, even if the full fine-tuning pipeline doesn't always leverage this knowledge optimally [5].
While scFoundation demonstrates strong performance in certain specialized tasks—particularly pooled-data evaluation where it achieved F1 scores of 0.971 in drug response prediction—it shows more limited generalization capability in cross-data and zero-shot settings [22] [11]. The model also faces practical limitations regarding gene set compatibility, as it "required each dataset to exactly match the genes from its own pretraining data," creating challenges for application to novel datasets [14].
scFoundation's architecture uses pretrained gene embeddings as inputs for graph neural-network based models like GEARS for perturbation prediction [5]. While this approach shows promise, current benchmarks indicate it hasn't yet surpassed simpler alternatives in generalization tasks. However, it's worth noting that scFoundation excels in read-depth enhancement and specific drug response prediction scenarios where data matches its pre-training specifications [38].
Current benchmarking evidence presents a nuanced picture of single-cell foundation model capabilities. While both scGPT and scFoundation show promising performance in specific domains, neither consistently outperforms simpler baseline methods across diverse tasks [5] [16] [14]. However, scGPT demonstrates distinctive strengths in cross-data and zero-shot generalization, particularly for drug response prediction in unseen cell types and conditions [22] [11] [29].
These findings have important implications for researchers and drug development professionals. Model selection should be guided by specific application requirements: scFoundation may be preferable for tasks involving well-characterized cellular systems where data matches its pre-training specifications, while scGPT appears better suited for exploratory research requiring generalization to novel biological contexts [22] [11] [4]. The consistent underperformance of both models compared to simple baselines in certain tasks highlights the importance of rigorous benchmarking and the need for continued methodological development [5] [14].
Future research directions should focus on improving model efficiency, enhancing zero-shot capabilities, and developing more biologically meaningful pretraining objectives. As noted in recent benchmarks, "the goal of providing a generalizable representation of cellular states and predicting the outcome of not-yet-performed experiments is still elusive" [14], indicating substantial room for advancement in the field of single-cell foundation models.
Within the rapidly evolving field of single-cell biology, foundation models like scGPT and scFoundation promise a transformative understanding of cellular behavior by leveraging vast amounts of transcriptomics data. These models are designed to capture universal patterns in gene regulation, which can then be fine-tuned for specific downstream tasks, such as predicting gene expression changes following genetic perturbations. Concurrently, traditional machine learning methods like Random Forest (RF), regularized linear models such as Elastic-Net, and straightforward analytical techniques like selecting Highly Variable Genes (HVG) have long served as reliable benchmarks for performance. This guide provides an objective, data-driven comparison of these foundational and traditional approaches, synthesizing findings from recent rigorous benchmarking studies to inform researchers and drug development professionals about their relative strengths and limitations in practical applications.
Recent independent benchmark studies consistently reveal a significant finding: traditional methods, including simple baseline models, often meet or exceed the performance of sophisticated foundation models in critical tasks like perturbation prediction and cell type identification.
A benchmark study evaluating scGPT and scFoundation against baseline models on four Perturb-seq datasets provides quantitative evidence of their relative performance, measured by the Pearson correlation of predicted versus actual differential gene expression (Pearson Delta) [5].
Table 1: Benchmarking Performance on Post-Perturbation Prediction (Pearson Delta)
| Model / Dataset | Adamson | Norman | Replogle (K562) | Replogle (RPE1) |
|---|---|---|---|---|
| Train Mean (Baseline) | 0.711 | 0.557 | 0.373 | 0.628 |
| scGPT | 0.641 | 0.554 | 0.327 | 0.596 |
| scFoundation | 0.552 | 0.459 | 0.269 | 0.471 |
| Random Forest (GO features) | 0.739 | 0.586 | 0.480 | 0.648 |
| Random Forest (scGPT embeddings) | 0.727 | 0.583 | 0.421 | 0.635 |
The data shows that the simple baseline of predicting the training set mean outperformed both foundation models across all datasets. Furthermore, a Random Forest (RF) model using Gene Ontology (GO) features as input "outperformed foundation models by a large margin" [5]. This superior performance was also consistent in a sub-analysis of combinatorial perturbations in the Norman dataset [5].
A separate study in Nature Methods confirmed these findings, noting that for predicting double perturbation effects, "all models had a prediction error substantially higher than the additive baseline," a simple model that sums individual logarithmic fold changes [14]. The study also developed a simple linear model that consistently matched or outperformed foundation models in predicting unseen single-gene perturbations [14].
In tasks where models are used without fine-tuning (zero-shot), foundation models have shown limitations compared to established methods.
Table 2: Zero-Shot Performance on Cell Type Clustering (Average BIO Score)
| Method | PBMC (12k) | Tabula Sapiens | Pancreas | Immune |
|---|---|---|---|---|
| HVG | ~0.55 | ~0.72 | ~0.65 | ~0.70 |
| scVI | ~0.63 | ~0.70 | ~0.62 | ~0.64 |
| Harmony | ~0.57 | ~0.68 | ~0.60 | ~0.69 |
| scGPT | ~0.60 | ~0.65 | ~0.55 | ~0.63 |
| Geneformer | ~0.35 | ~0.40 | ~0.40 | ~0.35 |
In zero-shot cell type clustering, both HVG selection and models like scVI and Harmony consistently outperformed scGPT and Geneformer across multiple datasets, as measured by Average BIO score [16]. Geneformer's performance was particularly low. For batch integration, a task critical for combining datasets from different sources, "the best batch integration scores for all datasets were achieved by selecting HVG," with foundation models again underperforming relative to established baselines [16].
To ensure the reproducibility and transparency of the comparisons cited, this section details the key experimental methodologies employed in the benchmark studies.
The evaluation of perturbation prediction models follows a structured process to ensure a fair and comparable assessment.
Title: Perturbation Prediction Benchmarking Workflow
1. Data Collection and Preprocessing:
2. Model Selection and Configuration:
3. Task Definition (Perturbation Exclusive - PEX):
4. Evaluation Metrics:
Evaluating models in a zero-shot setting tests their inherent biological understanding without task-specific fine-tuning.
Title: Zero-Shot Evaluation Workflow
1. Model Preparation:
2. Embedding Generation:
3. Downstream Task Execution:
4. Evaluation Metrics:
This section details essential computational tools and data resources central to conducting benchmarking studies in single-cell genomics.
Table 3: Essential Resources for Single-Cell Benchmarking Studies
| Category | Item / Software | Function in Research | Key Considerations |
|---|---|---|---|
| Benchmark Datasets | Perturb-seq (Adamson, Norman, Replogle) | Provides causal perturbation→expression data for training/evaluating predictive models. | Check for low perturbation-specific variance, which can complicate evaluation [5]. |
| CITE-seq Datasets (e.g., from SPDB) | Provides paired transcriptomic and proteomic data from the same cells for cross-modal method testing [40]. | Enables benchmarking on consistent biological conditions across omics. | |
| Foundation Models | scGPT | A transformer-based foundation model for single-cell data; used for prediction and generating gene/cell embeddings [5] [16]. | Requires fine-tuning for specific tasks; zero-shot performance may be limited [16]. |
| Geneformer | Another transformer-based foundation model pre-trained on single-cell data [16]. | Like scGPT, its zero-shot performance can be inconsistent [16]. | |
| Traditional ML Models | Random Forest (scikit-learn) | An ensemble tree-based model used for regression and classification. Often serves as a strong, interpretable baseline. | Can leverage biological features (GO terms) and often outperforms more complex models [5]. |
| Elastic-Net (GLMNET) | A linear model combining L1 and L2 regularization. Effective for feature selection and dealing with correlated variables [41]. | Useful for biomarker identification and building parsimonious models. | |
| Analysis & Evaluation | HVG Selection | A standard preprocessing step to select genes with high cell-to-cell variation for downstream analysis like clustering. | A simple yet highly effective baseline for tasks like clustering and batch integration [16]. |
| scVI / Harmony | Tools for single-cell data analysis, specializing in probabilistic modeling (scVI) and batch correction (Harmony) [16]. | Often outperform foundation models in tasks like batch integration and cell type clustering [16]. |
Single-cell foundation models (scFMs) represent a transformative advance in computational biology, leveraging large-scale deep learning to interpret complex single-cell data. Trained on millions of single-cell transcriptomes through self-supervised learning, these models develop foundational knowledge of cellular biology that can be adapted to various downstream tasks [1]. Among the leading scFMs, scGPT and scFoundation have emerged as prominent yet specialized models, each demonstrating distinct strengths across different application domains. This guide provides a comprehensive, evidence-based comparison of their capabilities, with particular focus on scGPT's proficiency in multi-omics integration versus scFoundation's performance in drug response prediction, drawing upon recent benchmarking studies to inform researchers and drug development professionals.
The specialized capabilities of scGPT and scFoundation stem from their distinct architectural designs and pretraining methodologies, which shape their effectiveness for different biological tasks.
scGPT utilizes a transformer architecture inspired by the Generative Pretrained Transformer (GPT) family, employing a unidirectional masked self-attention mechanism that iteratively predicts masked genes conditioned on known genes [1]. This design excels at generative tasks and multi-modal integration. The model comprises approximately 50 million parameters and was pretrained on around 33 million non-cancerous human cells [4]. A key strength of scGPT lies in its flexible input representation, which can incorporate diverse omics modalities—including scRNA-seq, scATAC-seq, CITE-seq, and spatial transcriptomics—through specialized tokenization strategies that bin expression values and use modality-specific tokens [1] [4].
scFoundation employs an asymmetric encoder-decoder architecture with approximately 100 million parameters, pretrained on roughly 50 million cells [4]. Unlike scGPT, it processes a comprehensive set of 19,264 human protein-encoding genes alongside common mitochondrial genes [4]. Its pretraining utilizes a read-depth-aware masked gene modeling objective with mean squared error loss, focusing on reconstructing gene expression values [4]. This design prioritizes capturing deep relationships within transcriptomics data rather than cross-modal integration.
| Model | Input Gene Strategy | Value Embedding | Positional Embedding | Multi-omics Support |
|---|---|---|---|---|
| scGPT | 1,200 highly variable genes (HVGs) | Value binning | Not used | Native support for multiple modalities |
| scFoundation | All 19,264 protein-encoding genes | Value projection | Not used | Primarily scRNA-seq focused |
Table 1: Input representation strategies for scGPT and scFoundation
The models differ significantly in their tokenization approaches. scGPT uses highly variable genes and employs value binning to transform continuous expression values into discrete tokens, facilitating its transformer-based processing [4]. In contrast, scFoundation utilizes the complete set of protein-encoding genes with value projection, preserving more comprehensive genomic information but requiring more computational resources [4]. These fundamental differences in architecture and input representation establish the foundation for their divergent performance across specialized tasks.
scGPT demonstrates superior capabilities in integrating diverse molecular modalities, a critical requirement for comprehensive cellular analysis. Benchmarking studies consistently highlight scGPT's architectural advantages for multi-omics tasks. According to comprehensive evaluations, scGPT's design natively supports "scRNA-seq, scATAC-seq, CITE-seq, and spatial transcriptomics" through its flexible tokenization system [1] [4]. This enables researchers to jointly analyze gene expression, chromatin accessibility, and protein abundance within a unified representation space.
The BioLLM benchmarking framework, which provides standardized evaluation of multiple scFMs, identified scGPT as exhibiting "robust performance across all tasks," with particular strength in multi-omics integration scenarios [17]. This cross-modal capability stems from scGPT's use of modality-specific tokens and its value binning approach, which creates a standardized representation scheme across different data types [1]. When processing multi-omics data, scGPT can effectively leverage relationships between different molecular layers, enabling more holistic cellular state characterization.
scFoundation shows specialized strength in drug response prediction, particularly in contexts with sufficient training data. The scDrugMap benchmarking study, which evaluated eight single-cell foundation models across 326,751 cells from 36 datasets, found that scFoundation "outperformed all others" in pooled-data evaluation for drug response prediction [21]. Specifically, scFoundation achieved the highest mean F1 scores of 0.971 and 0.947 using layer-freezing and fine-tuning strategies respectively, outperforming the lowest-performing model by 54% and 57% [21].
However, model performance varies significantly based on evaluation scenarios. In cross-data evaluation, where models are tested on completely independent datasets, UCE achieved the highest performance (mean F1 score: 0.774) after fine-tuning on tumor tissue, while scGPT demonstrated superior performance (mean F1 score: 0.858) in zero-shot learning settings [21]. This indicates that while scFoundation excels with adequate training data, scGPT may offer better generalization to novel drug compounds or cellular contexts without task-specific fine-tuning.
| Task | Best Performing Model | Key Metric | Performance Context |
|---|---|---|---|
| Multi-omics Integration | scGPT | Qualitative assessment | Native architectural support |
| Drug Response (Pooled-data) | scFoundation | F1 Score | 0.971 (layer-freezing) |
| Drug Response (Zero-shot) | scGPT | F1 Score | 0.858 (cross-data) |
| Post-Perturbation Prediction | Simple Baselines | Pearson Delta | Outperforms both models |
| Batch Integration | scGPT (on complex batches) | Batch mixing scores | Outperforms scFoundation |
Table 2: Task-specific performance comparison between scGPT and scFoundation
Recent independent benchmarking reveals important limitations for both models in certain applications. A critical evaluation published in Nature Methods found that for predicting transcriptome changes after genetic perturbations, "none outperformed the baselines," including deliberately simple additive and no-change models [14]. Similarly, a study in BMC Genomics reported that even the simplest baseline model—taking the mean of training examples—outperformed both scGPT and scFoundation for post-perturbation gene expression prediction [5].
Figure 1: scGPT Multi-omics Integration Workflow
The multi-omics integration protocol using scGPT follows a standardized workflow (Figure 1). First, data from different modalities (scRNA-seq, scATAC-seq, spatial transcriptomics, CITE-seq) undergo modality-specific tokenization, where each modality is assigned special token identifiers [1] [4]. Expression or accessibility values are then processed through value binning, which discretizes continuous measurements into predefined ranges [4]. The tokenized sequences are concatenated into a unified input sequence with positional information, though scGPT does not use traditional positional embeddings [4].
The model processes this integrated sequence through its transformer architecture, employing masked self-attention to capture cross-modal relationships. During fine-tuning for specific integration tasks, the cell embedding (a specialized [CLS] token) is typically extracted as the integrated representation [1]. For evaluation, researchers commonly assess clustering purity, batch integration metrics, and the preservation of biological variance using standardized benchmarks like the AIDA v2 dataset [4].
Figure 2: Drug Response Prediction Experimental Workflow
The drug response prediction protocol follows rigorous benchmarking standards established by scDrugMap [21]. As shown in Figure 2, the process begins with curating single-cell expression matrices from drug-treated samples, with balanced representation of sensitive and resistant cells. The scFoundation model serves as a feature extractor, generating latent representations of each cell's transcriptional state [21].
Two evaluation scenarios are implemented: pooled-data evaluation (models trained and tested on aggregated data from multiple studies) and cross-data evaluation (models tested on completely independent datasets) [21]. For model adaptation, researchers employ either layer freezing (using scFoundation as a fixed feature extractor) or fine-tuning with Low-Rank Adaptation (LoRA), which updates a small subset of parameters [21]. Performance is assessed using F1 scores, AUROC, and accuracy, with particular emphasis on generalizability across different tissue types, cancer types, and treatment regimens [21].
| Resource Type | Specific Examples | Function in Analysis |
|---|---|---|
| Data Resources | CZ CELLxGENE [1], Human Cell Atlas [1], GEO/SRA [1] | Provide standardized single-cell datasets for pretraining and benchmarking |
| Benchmarking Platforms | scDrugMap [21], BioLLM [17] | Offer standardized evaluation frameworks and metrics |
| Computational Tools | Low-Rank Adaptation (LoRA) [21], Layer Freezing [21] | Enable efficient model fine-tuning with limited data |
| Evaluation Metrics | F1 Score [21], Pearson Delta [5], Batch Integration Scores [16] | Quantify model performance across different tasks |
| Perturbation Datasets | Perturb-seq [5] [14], Norman et al. [14], Replogle et al. [14] | Provide ground truth for evaluating perturbation prediction |
Table 3: Essential Research Resources for scFM Evaluation
Successful application of scGPT and scFoundation requires access to several key resources. Public data repositories like CZ CELLxGENE (containing over 100 million unique cells) and the Human Cell Atlas provide essential pretraining corpora and standardized datasets [1]. For drug response prediction, the scDrugMap resource offers curated collections of 326,751 primary cells and 18,856 validation cells with drug response annotations [21].
Computationally, Low-Rank Adaptation (LoRA) has emerged as a critical technique for efficient fine-tuning of both models, significantly reducing computational requirements while maintaining performance [21]. For rigorous evaluation, established perturbation datasets (Adamson, Norman, Replogle) serve as standard benchmarks, though recent studies caution about their limitations in capturing perturbation-specific variance [5] [14].
Based on comprehensive benchmarking evidence, scGPT represents the superior choice for researchers requiring flexible multi-omics integration and generalization to novel biological contexts without extensive fine-tuning. Its architectural advantages in handling diverse data modalities and strong zero-shot performance make it particularly valuable for exploratory research where labeling data is scarce or impossible.
Conversely, scFoundation demonstrates specialized excellence in drug response prediction when sufficient training data is available, particularly in pooled-data scenarios where its comprehensive gene coverage and architectural optimization for transcriptomics data yield state-of-the-art performance. However, researchers should note that simple baseline models can sometimes outperform both scGPT and scFoundation for specific tasks like perturbation prediction [5] [14], highlighting the importance of task-specific evaluation before committing to computational intensive approaches.
For optimal model selection, researchers should consider their specific data characteristics (modality, sample size), application context (known vs. novel perturbations), and computational resources. As the scFM field rapidly evolves, frameworks like BioLLM [17] and scDrugMap [21] provide essential standardized platforms for ongoing evaluation of these powerful but specialized tools in biological and clinical research.
In the rapidly evolving field of single-cell biology, foundation models like scGPT and scFoundation promise to revolutionize how we analyze cellular systems by learning universal patterns from massive datasets. However, their true capability and performance relative to each other and to simpler methods must be rigorously assessed using standardized evaluation metrics and frameworks. This comparison guide objectively examines the performance of scGPT and scFoundation within a broader benchmarking context, focusing on critical metrics such as Pearson correlation for perturbation prediction and integration metrics for data harmonization. Drawing on recent experimental evidence, we summarize quantitative data and detail methodological protocols to provide researchers, scientists, and drug development professionals with a clear, evidence-based resource for model selection.
Independent benchmarking studies have consistently evaluated scGPT and scFoundation against various baseline models across multiple tasks and datasets. The tables below summarize key quantitative findings from these rigorous comparisons.
Table 1: Benchmarking performance of foundation models versus baselines on post-perturbation RNA-seq prediction (Pearson Delta metric)
| Model | Adamson Dataset | Norman Dataset | Replogle K562 | Replogle RPE1 |
|---|---|---|---|---|
| Train Mean (Baseline) | 0.711 | 0.557 | 0.373 | 0.628 |
| scGPT | 0.641 | 0.554 | 0.327 | 0.596 |
| scFoundation | 0.552 | 0.459 | 0.269 | 0.471 |
| Random Forest with GO Features | 0.739 | 0.586 | 0.480 | 0.648 |
| Random Forest with scGPT Embeddings | 0.727 | 0.583 | 0.421 | 0.635 |
Source: Adapted from [5]
Table 2: Zero-shot performance comparison on cell type clustering (Average BIO Score)
| Model | Pancreas Dataset | Tabula Sapiens | Immune Dataset | PBMC (12k) |
|---|---|---|---|---|
| HVG (Baseline) | 0.614 | 0.582 | 0.601 | 0.592 |
| scVI | 0.598 | 0.565 | 0.578 | 0.561 |
| Harmony | 0.587 | 0.554 | 0.572 | 0.550 |
| scGPT | 0.532 | 0.521 | 0.525 | 0.581 |
| Geneformer | 0.448 | 0.432 | 0.441 | 0.445 |
Source: Adapted from [16]
Table 3: Key model architecture and training specifications
| Specification | scGPT | scFoundation |
|---|---|---|
| Parameters | 53 million [42] | 100 million [5] |
| Pretraining Dataset Size | 33 million cells [10] [42] | 50 million cells [5] |
| Architecture | Transformer [42] | Transformer [4] |
| Gene Embedding Strategy | Value binning [42] | Value projection [12] |
| Primary Pretraining Task | Masked gene modeling [42] | Read-depth-aware masked gene modeling [4] |
The evaluation of perturbation prediction capabilities follows a standardized protocol designed to assess model generalizability for unseen perturbations (Perturbation Exclusive or PEX setup) [5]. The core methodology involves:
Data Preparation: Using Perturb-seq datasets (e.g., Adamson, Norman, Replogle) generated via CRISPR-based perturbations (CRISPRi/CRISPRa) combined with single-cell sequencing. Data is partitioned to ensure that specific perturbations are held out from the training set for evaluation [5] [14].
Model Input Formulation:
Fine-tuning Protocol: Both foundation models are fine-tuned on the benchmark datasets according to their original publications' specifications before evaluation [5].
Evaluation Metrics:
Baseline Models:
Figure 1: Workflow for perturbation prediction benchmarking
The assessment of zero-shot capabilities focuses on model performance without task-specific fine-tuning, which is critical for exploratory biological applications where labels are unknown [16]. The methodology includes:
Embedding Extraction: Generating cell embeddings from pre-trained foundation models without additional fine-tuning [16].
Cell Type Clustering Task:
Batch Integration Task:
The biological relevance of foundation models can be assessed by examining how well their internal representations align with established biological knowledge, particularly in the context of gene regulatory networks and signaling pathways.
Gene Embedding Analysis: Studies have compared the similarity of gene embeddings from foundation models against known biological relationships, including shared biological pathways (KEGG, REACTOME) and gene regulatory networks (CollecTRI) [5]. Random Forest models using biological prior knowledge (Gene Ontology vectors) consistently outperform foundation models, suggesting limitations in the biological meaningfulness of the learned representations [5].
Genetic Interaction Prediction: Models are evaluated on their ability to predict non-additive genetic interactions, categorized as "buffering," "synergistic," or "opposite" effects [14]. Current foundation models predominantly predict buffering interactions and rarely correctly identify synergistic interactions [14].
Figure 2: Comprehensive evaluation framework for foundation models
Table 4: Essential research reagents and computational resources for foundation model benchmarking
| Resource | Type | Function in Benchmarking |
|---|---|---|
| Perturb-seq Datasets (Adamson, Norman, Replogle) | Biological Data | Provide ground truth measurements of post-perturbation gene expression profiles for model training and evaluation [5] [14]. |
| Gene Ontology (GO) Vectors | Biological Knowledge Base | Serve as biologically meaningful features for baseline machine learning models [5]. |
| CRISPR Interference (CRISPRi) | Molecular Tool | Enables precise genetic perturbations in experimental datasets used for benchmarking [5]. |
| Hubble / Harmony | Computational Method | Established baselines for batch integration tasks in zero-shot evaluations [16]. |
| scVI | Computational Method | Generative model used as a baseline for batch correction and data integration [16]. |
| Random Forest Regressor | Machine Learning Algorithm | Provides simple yet strong baseline when equipped with biological features [5]. |
| Highly Variable Genes (HVG) | Feature Selection Method | Standard approach for selecting informative genes, used as a competitive baseline [16]. |
The comprehensive benchmarking of scGPT and scFoundation reveals a nuanced performance landscape. For perturbation prediction, both foundation models are consistently outperformed by simpler baseline approaches, with the Train Mean baseline exceeding scGPT and scFoundation across all four benchmark datasets, and Random Forest models using biological features achieving superior Pearson Delta metrics [5]. In zero-shot evaluation for cell type annotation and batch integration, both models demonstrate limitations, with scGPT showing variable performance across datasets and Geneformer consistently underperforming established methods [16].
These findings highlight critical considerations for researchers and drug development professionals: foundation models present promising frameworks but have not yet consistently surpassed simpler, more interpretable methods in key tasks. Model selection should therefore be guided by specific task requirements, dataset characteristics, and available computational resources, rather than assuming superior performance from more complex foundation architectures. Future development should focus on improving the biological meaningfulness of learned representations and enhancing zero-shot capabilities for truly exploratory biological discovery.
The benchmarking of scGPT and scFoundation reveals a nuanced landscape where neither model universally dominates. scFoundation demonstrates superior performance in specific, well-defined tasks like drug response prediction, while scGPT shows stronger generalization capabilities in cross-data and zero-shot settings. Critical limitations identified include inconsistent zero-shot performance, vulnerability to batch effects, and surprising underperformance against simpler models using biological prior knowledge. These findings underscore that model selection must be task-specific and highlight the urgent need for more robust benchmarking datasets and standardized evaluation frameworks. Future development should focus on improving pretraining objectives for better zero-shot generalization, creating more challenging benchmarks, and developing hybrid approaches that combine the strengths of foundation models with established biological knowledge. For biomedical research, the strategic integration of these models holds significant potential to accelerate drug discovery and personalized medicine, provided their current limitations are acknowledged and addressed.