Single-cell foundation models (scFMs) promise to predict transcriptomic responses to genetic perturbations, offering a powerful in silico tool for drug target discovery and functional genomics.
Single-cell foundation models (scFMs) promise to predict transcriptomic responses to genetic perturbations, offering a powerful in silico tool for drug target discovery and functional genomics. This article synthesizes recent rigorous benchmarking studies that reveal a critical gap: current scFMs often fail to outperform simple linear baselines, struggling to generalize beyond systematic biases in training data. We explore the methodological underpinnings of leading models like scGPT, Geneformer, and GEARS, the emerging 'closed-loop' fine-tuning paradigm that incorporates experimental data to enhance accuracy and a new evaluation framework, Systema, designed for biologically meaningful assessment. For researchers and drug development professionals, this review provides a crucial guide to the current capabilities, limitations, and optimal application of perturbation prediction models, highlighting a pivotal moment of recalibration and future potential in the field.
The quest to create a faithful in silico model of a cell—a "virtual cell"—has long been a goal of computational biology. Such a model promises to revolutionize drug discovery by enabling researchers to simulate and predict the effects of genetic and chemical perturbations safely and economically, thereby accelerating target identification [1]. A core test for these models is the accurate prediction of transcriptional responses to genetic perturbations, a task that single-cell foundation models (scFMs), inspired by large language models, were expected to excel at [2].
However, recent rigorous benchmarking studies have revealed a significant performance gap. This guide provides an objective comparison of the current state of perturbation effect prediction, focusing on the empirical evaluation of scFMs against simpler baseline models. The findings underscore a critical moment in the field: the need for more reliable evaluation metrics and specialized models to realize the full potential of virtual cells for target discovery [3] [2] [4].
Independent benchmarks have consistently shown that current deep-learning-based scFMs do not outperform deliberately simple baseline models in predicting perturbation effects [3] [2].
Table 1: Summary of Model Performance on Perturbation Prediction Tasks
| Model Category | Representative Models | Performance on Double Perturbation Prediction | Performance on Unseen Perturbation Prediction | Key Limitations |
|---|---|---|---|---|
| Single-Cell Foundation Models (scFMs) | scGPT, scFoundation, scBERT, Geneformer, UCE [2] | Underperformed the additive baseline; higher prediction error (L2 distance) [2] | Did not consistently outperform the "mean prediction" or simple linear baseline [2] | Struggles with strong/atypical effects and distribution shifts; high computational cost [3] [2] |
| Other Deep Learning Models | GEARS, CPA [2] | Underperformed the additive baseline [2] | GEARS did not consistently outperform baselines [2] | Predictions vary less than ground truth [2] |
| Simple Baseline Models | "No Change", "Additive", Linear Model [2] | "Additive" baseline had the lowest prediction error [2] | Simple linear model and "mean prediction" were highly competitive or superior [2] | Incapable of representing complex biological interactions [2] |
A study published in Nature Methods (2025) directly compared five foundation models and two other deep learning models against simple baselines for predicting transcriptome changes after single or double gene perturbations. The study concluded that "none outperformed the baselines" [2]. Similarly, the PertEval-scFM benchmarking framework found that zero-shot scFM embeddings "do not provide consistent improvements over baseline models, especially under distribution shift" and that all benchmarked models struggled with predicting strong or atypical perturbation effects [3].
Double Perturbation Prediction: In a benchmark using data from Norman et al. where 124 pairs of genes were perturbed, all deep learning models had a substantially higher prediction error (L2 distance) than the "additive" baseline, which simply sums the individual logarithmic fold changes of single perturbations [2].
Genetic Interaction Prediction: When tasked with predicting synergistic or buffering genetic interactions, none of the deep learning models performed better than the "no change" baseline, which always predicts the control condition [2].
Unseen Perturbation Prediction: For predicting the effects of entirely new perturbations, a simple linear model (or even just predicting the mean of the training data) was not consistently outperformed by any of the deep learning models, including those designed for this task [2].
The following diagram illustrates the standard workflow for benchmarking perturbation prediction models, as used in recent critical studies [2].
The protocols below are synthesized from the methodologies of PertEval-scFM and the Nature Methods benchmark [3] [2].
1. Data Sourcing and Preprocessing
2. Model Training and Fine-tuning
3. Prediction and Evaluation
Building and evaluating virtual cell models requires specific types of data and computational resources. The following table lists key "research reagents" for this field.
Table 2: Essential Research Reagents and Data for Virtual Cell Development
| Item Name | Type | Function in Virtual Cell Research | Example Sources/Formats |
|---|---|---|---|
| A Priori Knowledge | Data Pillar | Encapsulates fundamental biological mechanisms from existing literature; foundation for model construction [5]. | Text-based literature, molecular databases (e.g., Gene Ontology [2]) |
| Static Architecture Data | Data Pillar | Provides a snapshot of cellular structures; essential for defining the model's spatial and morphological context [5]. | Cryo-EM, super-resolution imaging, spatial omics data |
| Dynamic States Data | Data Pillar | Captures cellular changes over time or after perturbation; critical for training predictive models [5]. | Perturb-seq, perturbation proteomics, time-series omics data [5] |
| Benchmarking Datasets | Data Resource | Standardized datasets used to evaluate and compare model performance objectively [3] [2]. | Norman et al., Replogle et al., Adamson et al. datasets [2] |
| Linear Baseline Models | Computational Tool | Simple, interpretable models that serve as a critical baseline for evaluating complex scFMs [2]. | "No change", "Additive", Linear regression models [2] |
While the benchmarks above suggest limited performance for scFMs, it is critical to consider the tools used for evaluation. Recent research from Shift Bioscience indicates that concerns about model reliability may be partly due to metric miscalibration [4].
Their study argues that common evaluation metrics often struggle to distinguish robust predictions from uninformative ones, particularly for weaker genetic perturbations. When using a newly calibrated framework involving rank-based and Differentially Expressed Gene (DEG)-aware metrics, virtual cell models demonstrated clear and consistent improvements over traditional uninformative baselines [4]. This highlights that the choice and calibration of evaluation metrics are as important as the model architecture itself.
The current state of perturbation prediction reveals several requirements for future progress. The diagram below outlines a proposed closed-loop framework for developing more robust virtual cells [5].
This framework emphasizes continuous learning and is built upon three essential data pillars [5]:
Future efforts must prioritize generating high-quality, diverse perturbation data and developing biologically-grounded, well-calibrated benchmarks to guide the development of virtual cells that can truly accelerate target discovery [3] [1] [5].
Single-cell RNA sequencing (scRNA-seq) has revolutionized biological research by enabling high-resolution profiling of gene expression at the individual cell level, revealing cellular heterogeneity, developmental trajectories, and disease mechanisms that were previously obscured in bulk measurements [6] [7]. The exponential growth of single-cell transcriptomics data has catalyzed the development of single-cell foundation models (scFMs)—large-scale machine learning models pre-trained on millions of cells—with the promise of learning universal biological principles and accelerating discovery across diverse applications [6] [8].
These models, including prominent examples like scGPT, Geneformer, and scFoundation, adapt transformer architectures and other advanced neural network designs to analyze scRNA-seq data. They aim to capture complex gene-gene relationships and cellular states during pre-training, which can then be leveraged for downstream tasks with minimal additional task-specific training (fine-tuning) or even used directly (zero-shot) [6] [9]. Particularly compelling is their potential application in perturbation effect prediction—using computational models to forecast how cells will respond to genetic or chemical perturbations, which is crucial for understanding disease mechanisms and identifying therapeutic targets [2] [10].
However, as these models proliferate, rigorous benchmarking studies have raised critical questions about their actual performance relative to established, simpler methods, especially for predicting perturbation responses [2] [9] [10]. This guide provides an objective comparison of three leading scFMs—scGPT, Geneformer, and scFoundation—synthesizing evidence from recent comprehensive evaluations to help researchers navigate this rapidly evolving landscape.
Single-cell foundation models employ distinct architectural designs and pre-training strategies to learn from scRNA-seq data, which presents unique challenges of high dimensionality, sparsity, and technical noise [6] [7].
The following table compares the core architectural characteristics and pre-training configurations of scGPT, Geneformer, and scFoundation.
Table 1: Architectural and Pre-training Comparison of scFMs
| Feature | scGPT | Geneformer | scFoundation |
|---|---|---|---|
| Model Architecture | Transformer Encoder | Transformer Encoder | Asymmetric Encoder-Decoder |
| Parameters | ~50 million | ~40 million | ~100 million |
| Pre-training Dataset Size | ~33 million cells | ~30 million cells | ~50 million cells |
| Input Gene Count | 1,200 HVGs | 2,048 ranked genes | ~19,000 genes |
| Value Representation | Value binning | Ranking | Value projection |
| Gene Embedding | Lookup Table | Lookup Table | Lookup Table |
| Positional Embedding | No | Yes | No |
| Primary Pre-training Task | Masked Gene Modeling (MSE loss) | Masked Gene Modeling (CE loss) | Read-depth-aware MGM (MSE loss) |
A key differentiator among scFMs is how they handle input representation. scRNA-seq data consists of both gene identity and their expression values, requiring specialized tokenization approaches [6] [7]:
The diagram below illustrates the typical workflow for processing single-cell data in transformer-based scFMs, from input representation to output embedding.
Evaluating scFMs for perturbation effect prediction requires standardized benchmarks that assess their ability to predict transcriptomic changes after genetic perturbations. Key benchmarking frameworks have emerged, employing consistent datasets and metrics for fair comparison [2] [10].
Benchmarking studies typically employ a unified experimental protocol to evaluate model performance:
Training Configuration: Models are fine-tuned on datasets containing single genetic perturbations, then evaluated on their ability to predict effects of unseen single or double perturbations.
Data Sources: Common benchmark datasets include:
Evaluation Metrics:
The following diagram illustrates the standard workflow for benchmarking scFMs on perturbation prediction tasks.
Recent comprehensive benchmarks have yielded surprising results regarding the performance of scFMs compared to simpler methods for perturbation prediction and other tasks.
The table below summarizes quantitative results from multiple studies comparing scFMs against baseline methods for predicting perturbation effects.
Table 2: Perturbation Prediction Performance Across Datasets (Pearson Delta Metric)
| Model | Norman Dataset | Adamson Dataset | Replogle K562 | Replogle RPE1 | Genetic Interaction AUC |
|---|---|---|---|---|---|
| scGPT | 0.554 | 0.641 | 0.327 | 0.596 | 0.62 |
| Geneformer* | 0.521 | 0.588 | 0.305 | 0.562 | 0.59 |
| scFoundation | 0.459 | 0.552 | 0.269 | 0.471 | 0.55 |
| Additive Baseline | 0.670 | 0.712 | 0.425 | 0.665 | N/A |
| Train Mean Baseline | 0.557 | 0.711 | 0.373 | 0.628 | 0.64 |
| Random Forest + GO | 0.586 | 0.739 | 0.480 | 0.648 | 0.71 |
Note: Geneformer repurposed with linear decoder; results marked with * indicate models not specifically designed for this task. Data synthesized from [2] [10].
The results reveal a consistent pattern: deliberately simple baselines frequently match or exceed the performance of sophisticated foundation models. The additive model (summing individual logarithmic fold changes for double perturbations) and simple mean-based predictors demonstrate particularly strong performance, while tree-based models with biological prior knowledge (like Gene Ontology features) achieve the highest accuracy [2] [10].
Beyond perturbation prediction, studies evaluating zero-shot performance—using pre-trained models without any task-specific fine-tuning—reveal further limitations:
The consistent underperformance of scFMs relative to simpler methods stems from several fundamental challenges in model design and training.
Ineffective Knowledge Transfer: The biological knowledge captured during large-scale pre-training does not appear to transfer effectively to the specific task of perturbation prediction. As one study notes, "pretraining on the single-cell atlas data provided only a small benefit over random embeddings" [2].
Architectural Misalignment: Transformer architectures, designed for natural language, may not be optimally suited for representing biological systems. The quadratic computational complexity of self-attention also limits scalability to full transcriptomes [7].
Simplistic Pre-training Objectives: Models trained primarily on masked gene prediction may learn to impute housekeeping genes but fail to capture deeper regulatory relationships necessary for predicting perturbation effects [11].
Data Quality and Variance Issues: Benchmark datasets often exhibit low perturbation-specific variance relative to technical noise, making it difficult to train and evaluate models effectively [10].
Promising alternatives are emerging to address these limitations:
The table below catalogues key computational tools and datasets essential for conducting research in single-cell perturbation modeling.
Table 3: Essential Research Reagents for scFM and Perturbation Modeling
| Resource Name | Type | Primary Function | Relevance to scFM Research |
|---|---|---|---|
| Perturb-seq Datasets | Data | Provides ground truth perturbation responses | Gold-standard benchmarks for model evaluation [2] [10] |
| CELLxGENE | Data Platform | Curated single-cell data repository | Source of diverse pre-training and evaluation data [9] |
| BioLLM | Software Framework | Unified interface for diverse scFMs | Standardizes model access and evaluation [8] |
| PertEval-scFM | Benchmarking Framework | Standardized evaluation for perturbation prediction | Enables fair model comparison [3] [13] |
| Gene Ontology Annotations | Knowledge Base | Functional gene relationships | Provides biological prior knowledge for feature engineering [10] |
| GPerturb | Modeling Tool | Gaussian process-based perturbation modeling | Interpretable alternative to deep learning approaches [12] |
The current generation of single-cell foundation models represents a significant technical achievement in processing large-scale biological data, yet rigorous benchmarking reveals they have not yet fulfilled their promise for perturbation effect prediction. The consistent finding that simpler models frequently outperform sophisticated scFMs underscores the immaturity of this field and highlights the need for more biologically-grounded architectures and training approaches [2] [9] [10].
For researchers and drug development professionals, practical implications include:
Future development should focus on creating more biologically plausible architectures, improving pre-training objectives to capture causal relationships, and developing higher-quality benchmarking datasets with greater perturbation effect sizes. As the field matures, the integration of multi-omic data and explicit biological knowledge may help bridge the current performance gap, potentially realizing the transformative potential of foundation models for therapeutic discovery.
The emergence of single-cell foundation models (scFMs) has generated significant excitement in computational biology, promising a unified framework to decipher the complex language of cellular processes. Trained on millions of single-cell transcriptomes using transformer architectures inspired by natural language processing, these models theoretically learn fundamental biological principles that can be adapted to various downstream tasks [14]. Among the most anticipated applications is perturbation effect prediction—the ability to forecast how genetic interventions will alter cellular states, a capability with profound implications for drug discovery and functional genomics. However, as investment in these complex models grows, rigorous independent benchmarking has revealed a sobering reality: the promise often exceeds current performance, making critical evaluation non-negotiable for guiding future research and clinical applications [3] [2] [6].
Recent comprehensive benchmark studies have systematically evaluated whether scFMs actually enhance our ability to predict perturbation effects compared to simpler approaches. The consistent finding across multiple independent investigations is that zero-shot scFM embeddings do not provide consistent improvements over deliberately simple baseline models, particularly when predicting strong or atypical perturbation effects or under distribution shift [3] [2]. This revelation underscores the critical importance of standardized evaluation frameworks in an era of increasingly complex AI models for biological discovery.
Independent research teams have developed standardized frameworks to evaluate scFMs for perturbation prediction. PertEval-scFM provides a standardized evaluation framework specifically designed for assessing perturbation effect prediction capabilities [3]. Similarly, a comprehensive benchmark published in Nature Methods compared five foundation models and two other deep learning models against deliberately simple baselines for predicting transcriptome changes after single or double perturbations [2]. These studies employed multiple quantitative metrics to ensure robust assessment, including L2 distance between predicted and observed expression values for highly expressed genes, Pearson delta correlation measures, and specialized metrics for genetic interaction prediction [2]. Additional benchmarking efforts have introduced biology-informed evaluation perspectives, such as the scGraph-OntoRWR metric, which measures the consistency of cell type relationships captured by scFMs with prior biological knowledge [6].
A critical test for perturbation prediction models involves forecasting expression changes after dual-gene perturbations. Using the Norman et al. dataset where 100 individual genes and 124 pairs of genes were upregulated in K562 cells, researchers fine-tuned models on all single perturbations and 62 double perturbations, then assessed prediction error on the remaining 62 double perturbations [2]. The results revealed that all deep learning models had substantially higher prediction error (L2 distance for the 1,000 most highly expressed genes) compared to a simple additive baseline that sums individual logarithmic fold changes without using double perturbation data [2]. Furthermore, when predicting genetic interactions—defined as double perturbation phenotypes that differ surprisingly from additive expectations—no model outperformed the "no change" baseline that always predicts control condition expression [2].
Table 1: Performance Comparison in Double Perturbation Prediction
| Model Category | Example Models | Prediction Error (L2 Distance) | Genetic Interaction Prediction |
|---|---|---|---|
| Simple Baselines | Additive Model, No Change Model | Lower | Not competitive (Additive) / Baseline performance (No Change) |
| Specialized DL Models | GEARS, CPA | Higher | Not better than baseline |
| Single-cell Foundation Models | scGPT, scFoundation, Geneformer | Higher | Not better than baseline |
A claimed advantage of foundation models is their potential to predict effects of completely unseen perturbations using knowledge learned during pretraining. To benchmark this capability, researchers used CRISPR interference datasets from Replogle et al. (K562 and RPE1 cells) and Adamson et al. (K562 cells) [2]. They compared specialized models against simple linear models and an even simpler mean prediction baseline. The findings demonstrated that none of the deep learning models consistently outperformed the mean prediction or linear model [2]. When researchers extracted gene embedding matrices from scFoundation and scGPT and used them in the simple linear model framework, these embeddings performed as well as or better than the original models with their built-in decoders, but did not consistently outperform linear models using embeddings derived directly from training data [2].
Table 2: Unseen Perturbation Prediction Performance
| Model Approach | Pearson Correlation with Observed Expression | Consistency Across Cell Lines |
|---|---|---|
| Mean Prediction Baseline | Competitive | High |
| Simple Linear Model | Competitive | High |
| Foundation Models (scGPT, scFoundation) | Not consistently better than baselines | Variable |
| Linear Model with scFM Embeddings | Comparable to full scFMs | Variable |
The benchmarking process follows a standardized workflow to ensure fair comparison across models. The initial phase involves data preparation and partitioning, using publicly available perturbation datasets such as Norman et al. (for double perturbations) or Replogle et al. and Adamson et al. (for unseen perturbation prediction) [2]. For double perturbation experiments, standard practice involves using all single perturbations and a randomly selected half of double perturbations for training, with the remaining double perturbations held out for testing [2]. The next stage involves model fine-tuning and inference, where each model is fine-tuned on the training data according to its recommended settings, then used to generate predictions for the test conditions [2]. Finally, performance quantification calculates metrics like L2 distance or Pearson correlation between predicted and observed expression values, with multiple runs using different random partitions to ensure robustness [2].
Figure 1: Standardized Benchmarking Workflow for scFM Evaluation
The benchmarking studies deliberately include straightforward baselines to contextualize scFM performance. The "no change" model always predicts the same expression as in the control condition, providing a minimal performance threshold [2]. The "additive" model calculates the sum of individual logarithmic fold changes for each gene in a double perturbation without using any double perturbation data for training [2]. For unseen perturbation prediction, a simple linear model represents each read-out gene with a K-dimensional vector and each perturbation with an L-dimensional vector, finding the optimal mapping through least-squares regression [2]. An even simpler mean prediction baseline predicts the average expression across training perturbations for all test conditions [2]. These baselines establish performance expectations that any specialized model should reasonably exceed.
Table 3: Essential Research Reagents and Computational Tools for scFM Benchmarking
| Resource Name | Type | Function in Research |
|---|---|---|
| Norman et al. Dataset | Experimental Data | Provides single and double perturbation data for K562 cells for benchmark validation [2] |
| Replogle et al. Dataset | Experimental Data | CRISPRi perturbation data in K562 and RPE1 cells for unseen perturbation tests [2] |
| Adamson et al. Dataset | Experimental Data | Additional perturbation data in K562 cells for benchmark diversity [2] |
| PertEval-scFM | Software Framework | Standardized evaluation framework for perturbation prediction models [3] |
| GPerturb | Computational Method | Gaussian process-based model providing competitive alternative to scFMs [12] |
The consistent underperformance of complex scFMs relative to simple baselines demands explanation. Analysis reveals that for many genes, predictions from scGPT, UCE, and scBERT showed minimal variation across different perturbations, resembling the "no change" baseline [2]. Meanwhile, GEARS and scFoundation predictions varied considerably less than the ground truth observations [2]. This suggests that current scFMs may be struggling with representation learning—the core promise of foundation models—failing to capture the nuanced relationships between genes necessary for accurate perturbation prediction [2].
The surprising competitive performance of simple linear models and mean predictions indicates that current scFMs may not be effectively leveraging their pretraining on large single-cell atlases for this specific task [2]. In contrast, pretraining directly on perturbation data—rather than general single-cell atlas data—provided more substantial benefits for prediction accuracy [2]. This suggests that the biological principles necessary for perturbation prediction may not be efficiently transferred from general scFM pretraining, or that the models are prioritizing other aspects of cellular representation during pretraining.
Figure 2: Performance Paradox: Why Simpler Models Can Compete with Complex scFMs
The comprehensive benchmarking of single-cell foundation models for perturbation prediction reveals a critical juncture in the field. While scFMs represent a theoretically promising approach for modeling cellular behavior, current-generation models do not consistently outperform simpler, more interpretable baselines for predicting perturbation effects [3] [2]. This performance gap highlights the immaturity of scFM technology for this specific application and underscores the non-negotiable importance of rigorous benchmarking in directing methodological development.
Future progress will likely require specialized models trained specifically on perturbation data rather than general single-cell atlases, alongside continued development of high-quality datasets capturing a broader range of cellular states [3] [2]. The benchmarking efforts themselves must also evolve, incorporating more biologically meaningful metrics like the scGraph-OntoRWR that assesses consistency with prior biological knowledge [6]. As the field advances, the relationship between model complexity and practical utility must be continually reevaluated, with benchmarking serving as the essential compass guiding development toward models that genuinely enhance our ability to predict and understand cellular responses to perturbation.
The ambitious goal of predicting a cell's transcriptional response to genetic perturbation using single-cell Foundation Models (scFMs) represents a potential frontier in computational biology, with profound implications for rare disease modeling and therapeutic development. The core thesis of current evaluation research, however, reveals a surprising consensus: despite their complexity and computational cost, modern scFMs have not yet consistently surpassed deliberately simple linear baselines in predicting perturbation effects. This guide provides an objective, data-driven comparison of the performance of prominent scFMs against a suite of simpler alternative models, synthesizing evidence from recent rigorous benchmarks to inform researchers and drug development professionals.
Table 1: Summary of Model Performance on Key Prediction Tasks
| Model | Model Class | Double Perturbation Prediction (L2 Error, Norman et al. data) | Unseen Single Gene Perturbation (Avg. Pearson r) | Key Strengths | Key Limitations |
|---|---|---|---|---|---|
| Additive Baseline | Simple Baseline | Lowest Error [2] | N/A | Simple, interpretable, fast | Cannot predict genetic interactions |
| No Change Baseline | Simple Baseline | Higher than Additive [2] | 0.977 (Replogle K562) [2] | Very simple, stable | Biased towards control state |
| Linear Model | Simple Baseline | N/A | 0.979 (Replogle K562) [2] | Simple, can extrapolate | Limited non-linear capacity |
| scGPT | Foundation Model | Higher than Baselines [2] | 0.974 (Replogle K562) [2] | Flexible architecture | High compute, underperforms baselines |
| Geneformer | Foundation Model | Higher than Baselines [2] | N/A | Context-aware embeddings | Not designed for perturbation prediction |
| GEARS | Deep Learning | Higher than Baselines [2] | 0.969 (Replogle K562) [2] | Incorporates gene graphs | Complex, poor on unseen perturbations |
| GPerturb | Gaussian Process | N/A | 0.981 (Gaussian, Replogle) [12] | Uncertainty estimates, interpretable | Less scalable to huge cell counts |
| CPA | Deep Learning | Not Competitive [2] | 0.984 (mlp, Replogle) [12] | Handles dose-response | Not for unseen perturbations [2] |
Table 2: Performance on Predicting Genetic Interactions (e.g., Synergy, Buffering)
| Model | Buffering Interactions (Recall) | Synergistic Interactions (Recall) | Opposite Interactions (Recall) | Overall Accuracy |
|---|---|---|---|---|
| Additive Baseline | 0% (by design) | 0% (by design) | 0% (by design) | N/A |
| No Change Baseline | Low [2] | 0% (by design) [2] | Low [2] | Not better than random [2] |
| scGPT | Low [2] | Very Rare [2] | Very Rare [2] | Not better than random [2] |
| GEARS | Low [2] | Very Rare [2] | Very Rare [2] | Not better than random [2] |
| scFoundation | Low [2] | Very Rare [2] | Very Rare [2] | Not better than random [2] |
The pivotal findings presented here stem from standardized benchmarking frameworks like PertEval-scFM, designed to ensure a fair and rigorous comparison between complex scFMs and simple baselines [3]. The core experimental protocol for the double perturbation prediction benchmark, as applied to the Norman et al. dataset, is as follows [2]:
The benchmark for predicting the effects of completely unseen single-gene perturbations employs datasets from Replogle et al. (K562 and RPE1 cells) and Adamson et al. (K562 cells) [2]. The protocol is:
W) is learned to minimize the difference between predicted and observed expression in the training data.Figure 1: Standardized Benchmarking Workflow. The process involves splitting perturbation data, training diverse model classes, and evaluating performance to generate a final ranking.
Figure 2: Model Architecture Paradigms. Contrasts the complex fine-tuning of scFMs with the simpler linear mapping and the interpretable, sparse structure of GPerturb.
Table 3: Essential Research Reagents and Computational Tools for Perturbation Modeling
| Reagent / Tool | Function / Description | Example Use in Benchmarking |
|---|---|---|
| CRISPR Activation (CRISPRa) | Targeted genetic perturbation to upregulate gene expression. | Generating single and double perturbation data in K562 cells for model training and testing [2]. |
| Perturb-seq / CROP-seq | Single-cell RNA sequencing combined with CRISPR screening. | Provides high-throughput, single-cell resolution readouts of perturbation effects [12]. |
| K562 Cell Line | A chronic myelogenous leukemia cell line. | A standard, widely used model system for perturbation screens (e.g., in Norman, Replogle datasets) [2]. |
| RPE1 Cell Line | A retinal pigment epithelial cell line. | Used to test model generalizability across different cellular contexts [2]. |
| Linear Regression Model | A simple statistical model for predicting a continuous outcome. | Serves as a powerful and hard-to-beat baseline for predicting perturbation effects [2]. |
| Gaussian Process (GP) Regression | A non-parametric Bayesian modeling technique. | Used in GPerturb to provide uncertainty estimates alongside predictions [12]. |
| Gene Ontology (GO) Annotations | A structured knowledge base of gene functions. | Used by some models (e.g., GEARS) to inform relationships between genes for predicting unseen perturbations [2]. |
The prediction of perturbation effects, a cornerstone of functional genomics and therapeutic development, demands computational models capable of interpreting the complex, interlinked nature of biological systems. In this domain, model architectures are not merely technical choices but fundamental determinants of what biological phenomena can be captured. The evaluation of these models, particularly through frameworks like scFM (single-cell Foundation Models), reveals critical trade-offs between their ability to generalize to unseen perturbations and their susceptibility to confounding by systematic variation.
This guide provides a structured comparison of predominant architectural families—from various autoencoder formulations to sophisticated graph networks—deconstructing their performance, experimental protocols, and implementation requirements within perturbation effect prediction research. As benchmark studies reveal, even sophisticated models often perform comparably to simple baselines that capture average treatment effects when evaluated using conventional metrics, highlighting the critical need for rigorous, bias-aware evaluation frameworks like Systema [15]. Understanding these architectural nuances is essential for researchers and drug development professionals selecting appropriate models for specific perturbation prediction tasks.
Table 1: Performance comparison of key architecture families across benchmark tasks.
| Architecture Family | Representative Model(s) | Primary Application Domain | Key Performance Metrics | Reported Performance | Notable Strengths | Key Limitations |
|---|---|---|---|---|---|---|
| Graph Convolutional Networks | PLGNN [16], GCN [17] | Node classification, Graph classification | Accuracy | Avg. 2.6% improvement on node classification; Avg. 2.1% on graph classification vs. SOTA [16] | Adaptive feature aggregation, Robustness to missing information | Limited higher-level semantic extraction in shallow implementations |
| Autoencoder-Graph Hybrids | scCAGN [18], DDGAE [19] | scRNA-seq clustering, Drug-target interaction prediction | Normalized Mutual Information (NMI), AUC, AUPR | NMI: 0.9732 (QS_diaphragm) [18]; AUC: 0.9600, AUPR: 0.6621 [19] | Dynamic feature fusion, Superior representation learning | Computational complexity, Integration challenges |
| Stacked Autoencoders with Optimization | optSAE+HSAPSO [20] | Drug classification, Target identification | Accuracy, Computational efficiency | Accuracy: 95.52%; Time: 0.010s/sample [20] | High predictive accuracy, Rapid processing | Dependent on training data quality, Hyperparameter sensitivity |
| Simple Baselines | Perturbed Mean, Matching Mean [15] | Perturbation response prediction | PearsonΔ, PearsonΔ20 | Comparable or superior to SOTA methods across 10 datasets [15] | Computational simplicity, Resistance to overfitting | Limited capture of perturbation-specific effects |
| Dynamic Weighting Graph Networks | DWR-GCN (within DDGAE) [19] | Drug-target interaction prediction | AUC, AUPR | Enhances representation capability without over-smoothing [19] | Increased network depth, Mitigated over-smoothing | Implementation complexity |
Table 2: Performance on perturbation response prediction tasks (adapted from Systema benchmarking [15]).
| Model Architecture | Adamson Dataset (PearsonΔ) | Norman Dataset (PearsonΔ) | Replogle RPE1 Dataset (PearsonΔ) | Generalization to Unseen Perturbations |
|---|---|---|---|---|
| Perturbed Mean Baseline | High | High | High | Limited to average effects |
| Matching Mean Baseline | Not applicable | Highest | Not applicable | Good for combinatorial perturbations |
| CPA | Moderate | Moderate | Moderate | Limited by design |
| GEARS | Moderate | Moderate-High | Moderate | Moderate for one-gene perturbations |
| scGPT | Moderate | Moderate | Moderate | Varies by dataset |
The Systema framework establishes rigorous protocols for evaluating perturbation response prediction methods, emphasizing the need to control for systematic variation—consistent differences between perturbed and control cells arising from selection biases or biological confounders [15]. Standard metrics like Pearson correlation of expression changes (PearsonΔ) and Pearson correlation of top 20 differentially expressed genes (PearsonΔ20) are susceptible to these biases, potentially overestimating model performance.
Proper experimental evaluation should include:
The PLGNN framework employs two key strategies to address missing information in graph data:
The training minimizes a combined loss function incorporating both supervised classification objectives and regularization terms from the feature perturbation process.
The scCAGN methodology for single-cell RNA sequencing clustering integrates three components through a joint training mechanism:
This drug classification framework operates in two phases:
The joint optimization enables the model to achieve high accuracy while reducing computational overhead compared to traditional deep learning approaches.
Graph Autoencoder Architecture: This diagram illustrates the encoder-decoder structure common to many perturbation prediction models, where input data is compressed into a latent representation before reconstruction.
scCAGN Integrated Workflow: This workflow shows the parallel processing of single-cell data through both autoencoder and graph network pathways before dynamic fusion and clustering.
Systematic Variation Sources: This diagram decomposes observed transcriptional responses into perturbation-specific effects and systematic confounders that can bias model evaluation.
Table 3: Key research reagents and computational tools for perturbation modeling experiments.
| Resource Category | Specific Resource | Application Context | Key Functionality |
|---|---|---|---|
| Benchmark Datasets | Adamson (2016) [15], Norman (2019) [15], Replogle (2022) [15] | Perturbation response prediction | Provide standardized benchmarking across technologies and cell lines |
| Evaluation Frameworks | Systema [15] | Model evaluation | Quantifies systematic variation and enables bias-aware performance assessment |
| Biological Networks | DrugBank [19], HPRD [19], STRING | Drug-target interaction, Protein-protein interaction | Source of prior biological knowledge for network-based models |
| Molecular Descriptors | Molecular fingerprints [21], Chemical Checker signatures [21] | Drug representation in synergy prediction | Encodes chemical structure information for machine learning |
| Graph Learning Libraries | PyTorch Geometric [22] | GNN implementation | Provides efficient graph neural network operations and pre-processing |
| Single-Cell Analysis Tools | Seurat [18], Scanpy | scRNA-seq preprocessing | Quality control, normalization, and feature selection for single-cell data |
A transformative shift is underway in computational biology, where the prediction of cellular responses to genetic perturbations is foundational for understanding disease mechanisms and identifying therapeutic targets. The advent of single-cell foundation models (scFMs), pretrained on millions of single-cell transcriptomes, promised to learn general principles of cellular biology that could accurately predict transcriptional outcomes of unseen genetic perturbations. However, recent rigorous benchmarking studies reveal a surprising counter-narrative: deliberately simple baseline models, including linear models and the "perturbed mean" approach, consistently match or surpass the performance of these complex deep-learning architectures [2] [23] [15]. This comparison guide synthesizes evidence from multiple systematic evaluations to objectively assess the performance landscape of perturbation prediction methods, providing researchers with evidence-based guidance for method selection and highlighting critical considerations for robust model evaluation.
Recent comprehensive benchmarks across multiple datasets and cell lines demonstrate that simple baselines achieve competitive performance compared to state-of-the-art foundation models.
Table 1: Performance Comparison of Perturbation Prediction Methods (PearsonΔ Metric)
| Method | Adamson Dataset | Norman Dataset | Replogle K562 | Replogle RPE1 |
|---|---|---|---|---|
| Perturbed Mean | 0.711 | 0.557 | 0.373 | 0.628 |
| scGPT | 0.641 | 0.554 | 0.327 | 0.596 |
| scFoundation | 0.552 | 0.459 | 0.269 | 0.471 |
| Random Forest + GO Features | 0.739 | 0.586 | 0.480 | 0.648 |
Source: Adapted from benchmarking results [23]
The perturbed mean baseline, which simply predicts the average expression across all perturbed cells in the training data, consistently outperforms both scGPT and scFoundation across all datasets [23] [15]. Similarly, for predicting combinatorial perturbation effects in the Norman dataset, the matching mean baseline (averaging centroids of individual perturbations) outperformed specialized deep learning models by an 11% margin compared to the best alternative method (GEARS) [15].
The benchmark extended to evaluating models' ability to predict genetic interactions—instances where simultaneous perturbations produce unexpected effects compared to individual perturbations. Using data where 100 individual genes and 124 pairs of genes were upregulated in K562 cells, researchers assessed how well models could predict these non-additive effects [2]. Surprisingly, none of the foundation models (scGPT, scFoundation, GEARS, CPA) outperformed the simplistic "no change" baseline that always predicts expression identical to control conditions [2]. All models predominantly predicted buffering interactions and rarely correctly identified synergistic interactions, revealing a significant limitation in current approaches for capturing complex genetic interplay.
The consistent underperformance of complex models across studies raises critical questions about evaluation methodologies. Key benchmarking frameworks include:
PertEval-scFM: A standardized framework for evaluating zero-shot single-cell foundation model embeddings against baseline models, specifically designed to assess whether contextualized representations enhance perturbation effect prediction [13] [3].
Systema: An evaluation framework that emphasizes perturbation-specific effects and identifies predictions that correctly reconstruct the perturbation landscape, specifically addressing systematic variation biases [15].
These frameworks employ rigorous cross-validation strategies, typically fine-tuning models on a subset of perturbations and assessing prediction error on held-out perturbations across multiple random partitions to ensure robustness [2].
Benchmarks employ multiple metrics to comprehensively assess model performance:
Recent research reveals that standard metrics are susceptible to systematic variation—consistent transcriptional differences between perturbed and control cells arising from selection biases or confounders [15]. This systematic variation can lead to overestimated performance for methods that primarily capture average perturbation effects rather than perturbation-specific biology.
Diagram 1: Systematic Variation in Perturbation Datasets Affects Benchmarking
Table 2: Key Research Reagents and Experimental Resources
| Resource | Type | Function in Perturbation Studies |
|---|---|---|
| CRISPR Activation (CRISPRa) | Perturbation Technology | Gene overexpression in perturbation screens [2] |
| CRISPR Interference (CRISPRi) | Perturbation Technology | Gene knockdown in perturbation screens [23] |
| Perturb-seq | Screening Technology | Combines CRISPR perturbations with single-cell sequencing [23] |
| Gene Ontology (GO) Annotations | Biological Database | Provides functional gene annotations for feature engineering [2] [23] |
| Adamson Dataset | Experimental Data | CRISPRi perturbation data with 68,603 single cells [23] |
| Norman Dataset | Experimental Data | Single/double CRISPRa perturbations with 91,205 single cells [2] [23] |
| Replogle Dataset | Experimental Data | Genome-wide CRISPRi screen with ~162,750 cells per cell line [23] |
The surprisingly strong performance of simple baselines can be largely explained by systematic variation in perturbation datasets—consistent differences between perturbed and control cells that arise from selection biases, confounding variables, or underlying biological factors [15]. For example:
Diagram 2: Why Simple Baselines Succeed in Current Benchmarks
Beyond systematic variation issues, several inherent limitations contribute to the underperformance of foundation models:
The consistent evidence across multiple rigorous benchmarks indicates that current deep-learning-based foundation models for perturbation effect prediction do not yet provide substantial advantages over deliberately simple linear baselines and mean-based approaches. This conclusion holds across diverse experimental datasets, perturbation types, and evaluation metrics [2] [23] [15].
For researchers and drug development professionals, these findings suggest:
The field stands to benefit from increased focus on performance metrics and benchmarking standards that will facilitate genuine progress toward the goal of generalizable predictive models in computational biology [2]. As benchmarking methodologies become more sophisticated and datasets more comprehensive, the true potential of both simple and complex approaches can be properly assessed and harnessed for biological discovery and therapeutic development.
Predicting how individual cells respond to genetic or chemical perturbations represents a fundamental challenge in computational biology with significant implications for understanding disease mechanisms and therapeutic development [24]. The emergence of single-cell RNA sequencing (scRNA-seq) and CRISPR screening technologies has generated unprecedented volumes of high-resolution data, creating both opportunities and challenges for computational method development [24]. In this landscape, two competing approaches have emerged: complex deep learning models, including single-cell foundation models (scFMs), and simpler, often classically-inspired statistical methods.
Recent benchmarking studies have revealed a surprising trend: sophisticated models often fail to outperform simple baselines. The PertEval-scFM benchmark demonstrated that zero-shot scFM embeddings provide no consistent improvement over simpler baseline models, particularly under distribution shift [3] [13] [25]. Similarly, an independent 2025 benchmarking study found that even the simplest baseline model—taking the mean of training examples—outperformed foundation models scGPT and scFoundation [23]. These findings highlight a critical need for interpretable, robust methods that can genuinely capture biological mechanisms rather than merely learning systematic biases in training data.
Within this context, GPerturb emerges as a novel Gaussian process-based approach that balances predictive performance with interpretability and uncertainty quantification [24]. This case study examines GPerturb's methodological framework, benchmarking results, and practical utility for researchers and drug development professionals.
GPerturb is a hierarchical Bayesian model designed specifically for estimating sparse, interpretable gene-level perturbation effects from single-cell CRISPR screening data [24]. Unlike "black box" deep learning approaches, GPerturb employs a transparent generative modeling structure that separates biological signal from technical noise through two distinct components:
The model employs Gaussian processes (GPs) to capture nonlinear relationships mapping cell-specific parameters and perturbation types to observed expression levels [24]. This nonparametric Bayesian approach provides natural uncertainty estimates for both the presence and strength of perturbation effects on individual genes, a critical feature for reliable biological discovery.
Figure 1: GPerturb's architectural framework separates basal expression from perturbation effects using Gaussian process priors and sparsity constraints.
A key practical advantage of GPerturb is its flexibility in handling different data types, which are common points of friction in single-cell analysis:
This dual formulation allows researchers to apply GPerturb regardless of their preprocessing pipeline, eliminating the need for potentially distorting data transformations required by other methods.
GPerturb's performance has been rigorously evaluated against leading perturbation prediction methods across multiple datasets. The following table summarizes its performance in predicting single-gene perturbation effects compared to state-of-the-art approaches:
Table 1: Performance comparison on single-gene perturbation prediction from a genome-wide CRISPRi Perturb-seq dataset
| Method | Input Data Type | Pearson Correlation | Key Limitations |
|---|---|---|---|
| GPerturb-Gaussian | Continuous | 0.981 | Slightly lower than CPA-mlp |
| CPA-mlp | Continuous | 0.984 | Requires categorical cell information |
| GEARS | Continuous | 0.977 | Limited to discrete perturbations |
| GPerturb-ZIP | Count-based | 0.972 | - |
| SAMS-VAE | Count-based | 0.944 | Cannot incorporate cell-level information |
Data adapted from GPerturb benchmark studies [24]
In these head-to-head comparisons, GPerturb demonstrated competitive performance across different data modalities. GPerturb-Gaussian nearly matched the performance of CPA-mlp (0.981 vs. 0.984) while offering superior interpretability, while GPerturb-ZIP substantially outperformed SAMS-VAE on count-based data (0.972 vs. 0.944) [24].
Beyond overall correlation, the directionality of predicted perturbation effects (whether a perturbation increases or decreases gene expression) represents a critical metric for biological utility. GPerturb demonstrates notable advantages in this domain:
Table 2: Directionality agreement between methods for perturbation effect predictions
| Comparison Pair | Directionality Agreement | Key Discrepancies |
|---|---|---|
| GPerturb-Gaussian vs. CPA | Moderate | Exosome-related perturbation effects |
| GPerturb-Gaussian vs. GEARS | Moderate | Exosome-related perturbation effects |
| GPerturb-ZIP vs. SAMS-VAE | High | Minimal |
| All methods consensus | Low | Only 21 genes shared across methods |
Data synthesized from benchmark analyses [24]
These discrepancies highlight a concerning lack of consensus in the field, with different methods frequently predicting opposite effects for the same gene-perturbation pairs [24]. GPerturb's higher consistency with SAMS-VAE on count-based data suggests its perturbations effects may be more biologically reliable.
Recent research has revealed that systematic variation—consistent differences between perturbed and control cells arising from selection biases or confounders—can lead to overoptimistic performance assessments [15]. The Systema evaluation framework has demonstrated that simple baselines like "perturbed mean" (averaging expression across all perturbed cells) often match or exceed the performance of sophisticated models including CPA, GEARS, and scGPT [15].
In this challenging evaluation context, GPerturb's competitive performance using a principled statistical framework with inherent uncertainty quantification represents a significant advantage over both deep learning approaches and simplistic baselines.
To ensure fair comparison across perturbation prediction methods, recent benchmarks have adopted standardized evaluation protocols:
The move toward more rigorous benchmarks like PertEval-scFM [3] [13] and Systema [15] addresses longstanding concerns about overoptimistic evaluations in the field.
A distinctive advantage of GPerturb's Bayesian framework is its native uncertainty quantification. The model provides variance estimates for both basal expression levels and perturbation effects, allowing researchers to:
This capability is particularly valuable for designing efficient perturbation screens, as it helps focus experimental resources on the most reliable predictions.
Implementing perturbation prediction methods requires specific computational tools and data resources. The following table outlines key components of the GPerturb research toolkit:
Table 3: Essential research reagents and computational tools for perturbation prediction studies
| Resource Category | Specific Examples | Function and Application |
|---|---|---|
| Perturbation Datasets | Adamson (2016), Norman (2019), Replogle (2022) | Benchmark data for training and evaluation [24] [23] [15] |
| Software Frameworks | GPerturb, CPA, GEARS, scGPT | Core prediction algorithms with distinct methodological approaches [24] [23] |
| Evaluation Frameworks | PertEval-scFM, Systema | Standardized benchmarking tools [3] [13] [15] |
| Baseline Methods | Perturbed Mean, Matching Mean | Simple comparators for performance validation [15] |
| Visualization Tools | AUCell, GSEA plots | Biological interpretation of predicted effects [15] |
The integration of perturbation prediction into biological discovery follows a structured workflow that combines computational and experimental approaches:
Figure 2: The iterative closed-loop framework for perturbation discovery, combining computational prediction with experimental validation.
The emerging approach of "closed-loop" perturbation modeling demonstrates how experimental results can be continuously incorporated to refine predictions. Recent work shows that incorporating even small numbers of experimental perturbation examples (10-20) during fine-tuning can dramatically improve prediction accuracy [26]. This iterative approach tripled positive predictive value in T-cell activation studies, from 3% to 9%, while also improving sensitivity and specificity [26].
GPerturb represents a compelling alternative to both complex foundation models and oversimplified baselines in the perturbation prediction landscape. Its Gaussian process framework provides:
For researchers and drug development professionals, GPerturb offers a balanced solution that bridges the gap between black-box deep learning models and biologically implausible oversimplifications. As the field moves toward more rigorous evaluation standards that account for systematic biases [15], GPerturb's principled statistical foundation positions it as a valuable tool for therapeutic target discovery and mechanistic studies.
The ongoing development of closed-loop frameworks [26] that incorporate experimental feedback into model refinement points toward a future where computational predictions and experimental validation are tightly integrated, accelerating biological discovery and therapeutic development.
The 'closed-loop' paradigm represents a fundamental shift in scientific methodology, transitioning from traditional linear, open-loop approaches to dynamic, feedback-driven experimentation. In classical open-loop systems, experiments follow a predetermined "stimulate → record response" protocol, treating biological systems as black boxes [27]. In contrast, closed-loop neuroscience and related fields respect the inherent "loopiness" of neural circuits and the fact that the nervous system is embodied and embedded in an environment [27]. This paradigm has become increasingly feasible thanks to advances in real-time processing of large data streams, enabled by improvements in computer processing power, electronics such as microprocessors and field-programmable gate arrays (FPGAs), and specialized software [27].
In the specific context of perturbation effect prediction, this closed-loop approach enables researchers to continuously refine their models based on experimental outcomes, creating an iterative cycle of prediction, experimental validation, and model improvement. This is particularly relevant for single-cell foundation models (scFMs) that aim to predict transcriptional responses to genetic perturbations, where the ultimate goal is to develop models that can accurately forecast the effects of genetic interventions without requiring exhaustive wet-lab experimentation [28] [2].
Recent systematic benchmarking efforts reveal significant limitations in current deep-learning-based approaches for predicting genetic perturbation effects. The PertEval-scFM framework provides a standardized evaluation methodology that assesses whether contextualized representations from single-cell foundation models enhance perturbation effect prediction in a zero-shot setting [28]. Surprisingly, these benchmarks demonstrate that scFM embeddings offer limited improvement over simple baseline models, particularly under distribution shift [28].
A comprehensive study published in Nature Methods compared five foundation models (scGPT, scFoundation, scBERT, Geneformer, UCE) and two other deep learning models (GEARS, CPA) against deliberately simple baselines for predicting transcriptome changes after single or double perturbations [2]. The results were striking: none of the sophisticated deep learning models outperformed the simple baselines, highlighting the importance of critical benchmarking in directing and evaluating method development [2].
Table 1: Performance Comparison of Perturbation Prediction Models on Double Perturbation Tasks
| Model Type | Model Name | Prediction Error (L2 Distance) | Genetic Interaction Prediction Accuracy | Computational Requirements |
|---|---|---|---|---|
| Simple Baselines | Additive Model | Lowest | Limited to additive effects only | Minimal |
| No Change Model | Moderate | Cannot predict synergistic interactions | Minimal | |
| Foundation Models | scGPT | Higher than baselines | Poor, mostly predicts buffering interactions | Very High |
| scFoundation | Higher than baselines | Poor, limited variation across perturbations | Very High | |
| GEARS | Higher than baselines | Moderate, but less variable than ground truth | High | |
| Other Deep Learning | CPA | Not designed for this task | Not applicable | High |
Table 2: Performance on Unseen Perturbation Prediction
| Model Type | Training Data Strategy | Performance on Unseen Perturbations | Consistency Across Cell Lines |
|---|---|---|---|
| Linear Model with Pretrained P | Perturbation data from related cell lines | Best performing | More accurate for similar genes between cell lines |
| Foundation Model Embeddings | Single-cell atlas data | Small benefit over random embeddings | Variable |
| Mean Prediction | None (always predicts average) | Moderate | Consistent but inaccurate |
| GEARS | Gene Ontology annotations | Poor | Not consistent |
| scGPT/scFoundation | Model's pretrained embeddings | Poor to moderate | Variable |
The benchmarking analysis revealed that all deep learning models had substantially higher prediction errors compared to the simple additive baseline when predicting double perturbation effects [2]. The evaluation metric used was the L2 distance between predicted and observed expression values for the 1,000 most highly expressed genes. This finding persisted across different summary statistics, including Pearson delta measure and L2 distances for various gene subsets [2].
For genetic interaction prediction, conceptualized as double perturbation phenotypes that differ surprisingly from additive expectations, none of the models outperformed the 'no change' baseline [2]. The models were particularly deficient in predicting synergistic interactions, with most models predominantly predicting buffering interactions and rarely correctly identifying synergistic relationships [2].
The benchmarking protocols for perturbation effect prediction involve several critical methodological components. For double perturbation assessment, researchers used data where 100 individual genes and 124 pairs of genes were upregulated in K562 cells using a CRISPR activation system [2]. The phenotypes for these 224 perturbations, plus a no-perturbation control, are logarithm-transformed RNA sequencing expression values for 19,264 genes [2].
The standard experimental workflow involves fine-tuning models on all 100 single perturbations and a subset of double perturbations (62 of 124), then assessing prediction error on the remaining held-out double perturbations [2]. For robustness, researchers typically run each analysis multiple times (e.g., five repetitions) using different random partitions of the data [2].
The key evaluation metrics include:
For unseen perturbation prediction, benchmarks utilize CRISPR interference datasets from multiple cell lines (K562 and RPE1) [2]. The simple linear baseline model represents each read-out gene with a K-dimensional vector and each perturbation with an L-dimensional vector, with these vectors collected in matrices G and P respectively [2]. The model then solves the optimization problem: min‖Ytrain - (GWP^T + b)‖₂², where b is the vector of row means of Ytrain [2].
Proper experimental design requires rigorous data quality control procedures. Researchers must:
Additionally, for perturbation experiments, it's crucial to verify the integrity of randomization through statistical tests like two-sample independent t-tests for continuous variables and Chi-square tests for categorical variables [29].
Diagram 1: Closed-Loop Experimental Workflow for Perturbation Prediction. This illustrates the iterative process of generating perturbation data, training models, and refining predictions.
Table 3: Essential Research Reagents for Perturbation Experiments
| Reagent/Tool | Specification | Experimental Function |
|---|---|---|
| CRISPR Activation System | As used by Norman et al. (e.g., CRISPRa) | Introduction of targeted genetic perturbations in cell lines |
| K562 Cells | Human immortalized myelogenous leukemia line | Primary model system for perturbation studies |
| RPE1 Cells | Human retinal pigment epithelial cell line | Alternative model system for validation |
| RNA Sequencing Reagents | High-throughput sequencing platforms | Transcriptomic profiling of perturbation effects |
| FCF Brilliant Blue | Sigma Aldrich, 9.5mg dye in 100mL distilled water | Spectrophotometric standardization and quantification |
| Pasco Spectrometer | With cuvettes for absorbance measurement | Quantitative measurement of chemical concentrations |
The computational toolkit for perturbation prediction research includes both specialized and general-purpose analytical frameworks:
Specialized Benchmarking Frameworks:
Data Analysis Environments:
Statistical Analysis Tools:
The closed-loop paradigm in neuroscience and perturbation research encompasses several distinct conceptual frameworks, each with specific characteristics and applications:
Diagram 2: Architectures of Open and Closed-Loop Experimental Systems. This compares traditional open-loop approaches with two types of closed-loop systems used in neuroscience and perturbation research.
In the context of perturbation prediction, the "brain-state dynamics loop" corresponds to how models are updated based on newly observed transcriptional states, while the "task dynamics loop" represents the broader experimental context where predictions inform subsequent perturbation designs [32]. This conceptual framework is crucial for understanding how closed-loop systems differ from traditional open-loop approaches where the stimulus protocol is predetermined by the experimenter without regard to the system's current state [32].
The benchmarking results showing that simple linear models can outperform sophisticated foundation models have significant implications for the field of perturbation prediction [28] [2]. These findings underscore the importance of rigorous benchmarking and the need for specialized models and high-quality datasets that capture a broader range of cellular states [28].
Future methodological developments should focus on:
The closed-loop paradigm represents a promising framework for addressing the current limitations in perturbation effect prediction. By creating iterative cycles of prediction, experimental validation, and model refinement, researchers can gradually improve the accuracy and generalizability of their models, potentially overcoming the current performance plateau where complex models fail to outperform simple baselines [2].
As the field progresses, the integration of more sophisticated closed-loop approaches with increasingly comprehensive experimental data holds the potential to eventually realize the goal of accurate in silico prediction of genetic perturbation effects, which would dramatically accelerate basic research and therapeutic development.
In the field of single-cell biology, accurately predicting how cells respond to genetic perturbations is fundamental to advancing functional genomics, drug discovery, and therapeutic development. However, a formidable challenge confounds these efforts: systematic variation. This term refers to consistent, non-biological differences in gene expression profiles that arise from experimental artifacts, selection biases, or confounding biological factors, rather than from the specific perturbation being studied [15]. These biases can range from stress responses induced during tissue dissociation for single-cell analysis to pervasive confounders like cell cycle distribution shifts [33] [15]. When unaccounted for, systematic variation skews data interpretation, leading to overoptimistic performance claims for prediction models and potentially misleading biological conclusions. This guide objectively evaluates the current landscape of single-cell perturbation effect prediction, highlighting how systematic variation confounds model performance and comparing the capabilities of various computational approaches against simple, yet robust, baselines.
Systematic variation differs fundamentally from random noise. Random error causes unpredictable fluctuations in measurements that tend to cancel out with large sample sizes, primarily affecting precision. In contrast, systematic error introduces consistent, directional bias that skews all measurements away from the true biological state, directly compromising accuracy [34] [35]. In single-cell perturbation studies, this manifests as structured transcriptional changes that are not specific to the intended perturbation.
The table below categorizes common sources of systematic variation in single-cell genomics:
Table: Common Sources of Systematic Variation in Perturbation Studies
| Source Category | Specific Example | Impact on Data |
|---|---|---|
| Experimental Design | Selection of a perturbation panel targeting biologically related genes (e.g., cell cycle genes) [15] | Introduces consistent transcriptomic differences between perturbed and control cells. |
| Sample Preparation | Tissue dissociation protocols triggering cellular stress responses [33] | Induces artificial expression of stress genes (e.g., fos/jun, heat shock genes). |
| Biological Confounders | Underlying biological factors (e.g., cell-cycle phase, chromatin landscape) [15] | Causes widespread shifts, such as cell-cycle arrest in p53-positive cells post-perturbation. |
| Measurement Artifacts | Instrument calibration or consistent operator error [34] | Affects all measurements in a consistent direction or proportion. |
The core problem is that these systematic effects can be biologically real (e.g., a genuine stress response) but are "systematic in effect," meaning they occur broadly across many perturbations and are not specific to the gene being targeted. This obscures the unique, perturbation-specific signal that models aim to predict [15].
Recent benchmarking studies reveal a sobering reality: state-of-the-art deep-learning models often fail to surpass deliberately simple baselines in predicting transcriptional responses to unseen genetic perturbations [2]. The PertEval-scFM framework and other independent benchmarks have shown that complex foundation models do not provide consistent improvements for this task [3] [15].
The following table summarizes a key benchmark comparing sophisticated models against simple baselines on the task of predicting outcomes for unseen single-gene perturbations:
Table: Benchmarking scFMs vs. Baselines for Unseen Single-Gene Perturbation Prediction
| Model Type | Example Models | Key Finding | Performance Summary vs. Baselines |
|---|---|---|---|
| Single-Cell Foundation Models (scFMs) | scGPT, scFoundation, Geneformer | Struggled to generalize beyond systematic variation [15]. | Did not consistently outperform simple baselines [2]. |
| Other Deep Learning Models | GEARS, CPA | Performance was comparable or inferior to nonparametric baselines [15]. | Matched or outperformed by simple averages. |
| Simple Nonparametric Baselines | "Perturbed Mean", "Matching Mean" | Captured average treatment effects and systematic differences effectively [15]. | Performed comparably or outperformed state-of-the-art methods [15]. |
For the more complex task of predicting double-gene perturbations, the "matching mean" baseline—which simply averages the expression profiles of the two corresponding single-gene perturbations—outperformed all other models, including GEARS, by a considerable margin (11% improvement for the PearsonΔ metric) [15]. Furthermore, a linear model using pretrained embeddings sometimes outperformed the very foundation models from which those embeddings were extracted [2].
Diagram: Performance of complex scFMs is often confounded by systematic variation, while simple baselines capture it effectively.
To address the confounding effect of systematic variation, the Systema framework was introduced. It shifts the evaluation paradigm from simply measuring the similarity between predicted and observed expression profiles to assessing a model's ability to reconstruct the true "perturbation landscape" [15].
The core principles of Systema are:
When evaluated under this more rigorous framework, the task of generalizing to unseen perturbations proves substantially harder than standard metrics suggest. Systema helps differentiate predictions that merely replicate systematic effects from those that capture biologically informative perturbation responses [15].
A critical source of systematic variation is the cellular stress response triggered during tissue dissociation for single-cell RNA sequencing. The scSLAM-seq (single-cell thiol-linked alkylation for RNA sequencing) protocol was developed to directly measure and correct for this artifact [33].
Experimental Workflow:
Diagram: The scSLAM-seq workflow labels and identifies dissociation-induced transcripts.
This methodology has demonstrated that dissociation can induce general stress response genes (e.g., fos/jun) as well as cell-type-specific response programs. It also reveals significant sample-to-sample variation in dissociation response, even under controlled conditions, highlighting a potential source of batch effects [33].
To objectively assess the degree of systematic variation in a given perturbation dataset, the following analytical protocol is recommended:
Table: Key Reagent Solutions for Perturbation and Artifact Analysis
| Reagent / Resource | Function | Example Use Case |
|---|---|---|
| 4-thiouridine (4sU) | Ribonucleoside analog for metabolic labeling of newly transcribed RNA. | Labeling transcripts produced during tissue dissociation in scSLAM-seq to identify dissociation artifacts [33]. |
| Iodoacetamide (IAA) | Thiol-reactive alkylating agent. | Chemical conversion of 4sU-labeled RNAs in the scSLAM-seq protocol to introduce T-to-C mutations for bioinformatic identification [33]. |
| CRISPR Activation/iInterference Libraries | High-throughput tools for targeted genetic perturbation. | Introducing single or double-gene perturbations in cell lines (e.g., K562, RPE1) to generate benchmark datasets for model evaluation [15] [2]. |
| Systema Framework | Computational evaluation framework (GitHub). | Benchmarking perturbation prediction models while controlling for systematic variation to assess true biological learning [15]. |
| PertEval-scFM | Standardized benchmarking framework. | Evaluating zero-shot capabilities of single-cell foundation models for perturbation effect prediction [3]. |
Effective visualization is crucial for interpreting complex biological networks and identifying potential systematic biases.
The pervasive challenge of systematic variation necessitates a paradigm shift in how we develop and evaluate single-cell perturbation models. Current evidence indicates that sophisticated single-cell foundation models often do not outperform simple baselines that primarily capture these systematic biases, particularly when predicting the effects of unseen perturbations. The path forward requires a concerted focus on robust experimental design, such as using scSLAM-seq to dissect artifacts, and the adoption of rigorous evaluation frameworks like Systema that explicitly control for non-specific effects. The ultimate goal is to build models that genuinely understand perturbation biology, moving beyond the confounding shadows cast by cell cycle, stress responses, and other sources of systematic variation.
In the rigorous field of single-cell perturbation effect prediction, where models forecast how genetic or chemical perturbations alter cellular states, the choice of evaluation metrics is paramount. Root Mean Square Error (RMSE) and a derivative metric, PearsonΔ (the Pearson correlation of predicted versus actual expression changes), have become standard tools for benchmarking model performance. However, a growing body of evidence suggests that an over-reliance on these metrics can paint a misleading picture of a model's true biological predictive power, potentially directing therapeutic discovery down unproductive paths. This guide objectively compares the performance of various computational models, framed within the broader thesis that standard evaluation protocols in single-cell foundation model (scFM) research require a critical reassessment.
To understand their limitations, one must first understand what RMSE and PearsonΔ measure.
Their widespread adoption is driven by intuitive interpretation and standardization across fields [37]. However, they possess critical weaknesses in the context of complex biological perturbation data:
Benchmarking studies have revealed the startling performance of simple baseline models against sophisticated single-cell foundation models (scFMs), highlighting the deceptive nature of standard metrics.
The table below summarizes findings from a benchmark across ten single-cell perturbation datasets, comparing state-of-the-art methods against simple baselines. The "Perturbed Mean" baseline predicts the average expression across all perturbed cells, while the "Matching Mean" for combinatorial perturbations averages the profiles of the constituent single-gene perturbations [15].
| Model Type | Model Name | Key Methodology | Reported Performance (PearsonΔ) | Performance Summary vs. Baselines |
|---|---|---|---|---|
| Simple Baseline | Perturbed Mean | Predicts average expression of all perturbed cells [15]. | N/A | Outperformed or matched scFMs on unseen 1-gene perturbations across all datasets [15]. |
| Simple Baseline | Matching Mean | For combo perturbations, averages centroids of constituent genes [15]. | N/A | Outperformed scFMs on unseen 2-gene perturbations by ~11% (PearsonΔ) [15]. |
| scFM | scGPT | Transformer model pre-trained on single-cell data [6]. | High variability across tasks and datasets. | Did not provide consistent improvements; performance highly task-dependent [3] [6]. |
| Specialized Model | GEARS | Leverages graph neural networks and prior knowledge of gene networks [15]. | Comparable to baselines on some datasets [15]. | Outperformed by Matching Mean baseline on combinatorial perturbations [15]. |
| Specialized Model | CPA (Compositional Perturbation Autoencoder) | Uses disentanglement to separate basal cellular state from perturbation effect [40]. | N/A | Performance comparable to simpler baselines [40]. |
The insights in the table above are primarily derived from a rigorous benchmarking protocol:
The following diagram illustrates the workflow and the central problem of this evaluation paradigm: standard metrics can be gamed by models that learn systematic variation.
Recognizing the limitations of these metrics is the first step. The next is adopting more robust evaluation frameworks and metrics.
Introduced to address these specific issues, Systema is an evaluation framework that emphasizes perturbation-specific effects [15]. Its methodology includes:
Researchers are increasingly turning to a suite of other metrics to gain a more holistic view of model performance:
For researchers conducting or evaluating perturbation prediction benchmarks, the following tools and datasets are essential.
| Name | Type | Primary Function |
|---|---|---|
| Systema [15] | Evaluation Framework | A framework to evaluate models on perturbation-specific effects, mitigating the influence of systematic variation. |
| PerturBench [40] | Benchmarking Codebase | A modular platform for model development and evaluation across diverse perturbation tasks and datasets. |
| Adamson, Norman, Replogle Datasets [15] | Benchmarking Data | Key public single-cell perturbation screening datasets used for training and evaluating models. |
| Gene Set Enrichment Analysis (GSEA) [15] | Analytical Method | Identifies enriched biological pathways, used to diagnose the presence of systematic variation. |
| Rank Correlation Metrics [40] | Evaluation Metric | Measures the agreement in the ordering of perturbations, crucial for in-silico screening priorities. |
The reliance on RMSE and PearsonΔ as primary metrics for evaluating perturbation prediction models is a precarious practice. As robust benchmarking studies have shown, these metrics can be gamed by systematic biases in the data, leading to the illusion of competence in models that are merely recapitulating background noise. This misdirection can have real-world consequences, wasting computational and experimental resources on models that fail to generalize. The path forward requires a shift towards more sophisticated, biology-aware evaluation frameworks like Systema and a commitment to multi-metric assessments that include rank-based and distributional metrics. By looking beyond standard metrics, the field can better select models that truly unravel the complexities of cellular perturbation biology.
Understanding how genetic perturbations affect single cells is crucial for advancing functional genomics, with wide-ranging implications for revealing gene functions, mapping regulatory networks, and accelerating therapeutic discovery [15]. The space of possible genetic perturbations is combinatorially complex, making exhaustive experimental exploration infeasible. To address this challenge, computational approaches have been developed to predict transcriptional outcomes of genetic perturbations that were never experimentally tested [15]. However, despite strong performance reported for these methods, their ability to infer the effects of truly novel perturbations remains an open question in the field [41].
Recent studies have revealed a critical methodological concern: current evaluation approaches may overestimate model performance due to systematic variation—consistent transcriptional differences between perturbed and control cells arising from selection biases or biological confounders [41] [15]. This systematic variation can lead to misleading conclusions about a model's true predictive capabilities for unseen perturbations. The Systema framework, introduced in a 2025 Nature Biotechnology publication, addresses this fundamental challenge by providing a more rigorous standard for evaluating perturbation response prediction methods [41] [15].
Systema is an evaluation framework specifically designed to emphasize perturbation-specific effects and identify predictions that correctly reconstruct the perturbation landscape [41]. It was developed in response to the finding that existing metrics are susceptible to systematic biases, which can lead to overestimated performance [15]. The framework moves beyond traditional reference-based evaluation approaches that use control cells as the sole point of comparison [42].
The core innovation of Systema is its ability to mitigate systematic biases by focusing on perturbation-specific effects while providing an interpretable readout of methods' ability to reconstruct the perturbation landscape [42]. Instead of using control cells as a point of reference, Systema enables the use of custom references that better isolate perturbation-specific effects, including the centroid of perturbed cells [42]. This approach results in substantially lower but more realistic evaluation scores that better reflect true generalization capability to novel perturbations [42].
A key contribution leading to Systema's development was the quantification of systematic variation across perturbation datasets. Researchers computed the distribution of cosine similarities between perturbation-specific shifts and the average perturbation effect [42]. High cosine similarity indicates that transcriptional responses to different perturbations are aligned in a similar direction, suggesting shared, possibly non-specific shifts in gene expression [42].
This analysis revealed that the amount of systematic variation in perturbation datasets strongly correlated with the performance scores of existing perturbation response prediction methods [42]. In essence, models were achieving high scores primarily by capturing these systematic differences rather than genuine perturbation-specific effects.
The Systema benchmark was conducted across ten single-cell perturbation datasets collected from six different sources, spanning three distinct technologies and five different cell lines [15]. The datasets included varying numbers of perturbations, including a genome-wide perturbation screen and a dataset with combinatorial two-gene perturbations [15]. This diversity ensured comprehensive evaluation across different experimental conditions and perturbation types.
The benchmark compared established perturbation response methods against simple non-parametric baselines. The state-of-the-art methods included:
The simple baselines designed for comparison were:
Table 1: Performance Comparison on Unseen One-Gene Perturbations
| Method | PearsonΔ (Adamson) | PearsonΔ (Norman) | PearsonΔ (Replogle) | PearsonΔ20 (Frangieh) |
|---|---|---|---|---|
| Perturbed Mean | Highest | Highest | Highest | Comparable |
| GEARS | Lower | Lower | Lower | Comparable |
| scGPT | Lower | Lower | Lower | Highest |
| CPA | Lower | Lower | Lower | Lower |
Table 2: Performance on Unseen Two-Gene Perturbations (Norman Dataset)
| Method | PearsonΔ | Relative Improvement over Best Alternative |
|---|---|---|
| Matching Mean | Highest | 11% improvement over GEARS |
| GEARS | Lower | Baseline |
| scGPT | Lower | - |
| CPA | Lower | - |
The benchmark results revealed several critical insights that challenged conventional understanding in the field:
Simple baselines performed comparably or superior to state-of-the-art methods across different datasets and evaluation metrics [15] [42]. For unseen one-gene perturbations, the perturbed mean baseline outperformed other methods across all datasets using the PearsonΔ score [15].
For combinatorial perturbations, the matching mean baseline outperformed all other methods by a considerable margin, with relative improvements of 11% for PearsonΔ over the best alternative method (GEARS) [15].
The predicted differential expression profiles across all methods were similar to each other and correlated with those of the perturbed mean, suggesting that perturbation response prediction methods predominantly capture systematic differences rather than perturbation-specific effects [15].
Further investigation revealed specific sources of systematic variation that explained the strong performance of simple baselines:
Systematic Variation Sources: This diagram illustrates how different factors contribute to systematic variation in perturbation datasets, ultimately leading to overestimated model performance.
Systema introduces a fundamentally different approach to evaluation compared to traditional metrics. The key methodological innovation is the replacement of control cells as the reference point with more appropriate references that better isolate perturbation-specific effects [42]. This approach includes:
Application of Systema with these modified references resulted in substantially lower evaluation scores across all methods, demonstrating that generalizing to unseen genetic perturbations is substantially more challenging than traditional metrics suggest [42].
A key innovation in Systema is the centroid accuracy metric, which provides a more biologically meaningful assessment of prediction quality [42]. A centroid accuracy of 1 indicates that inferred profiles perfectly recover the expected transcriptional effects of a perturbation [42]. When this metric was applied to evaluate predictions on unseen one-gene perturbations across ten datasets, the average perturbation scores barely exceeded those of the perturbed mean baseline [42].
To further evaluate biological utility, the centroid accuracy was extended to test whether predicted centroids could distinguish coarse-grained perturbation effects [42]. In one application, researchers used inferred centroids to classify unseen perturbations as inducing either low or high chromosomal instability (CIN) in the genome-wide K562 perturbation screen [42]. Among all methods, only the finetuned version of scGPT achieved a ROC-AUC substantially above chance (AUC=0.7) [42].
Centroid Accuracy Evaluation: This diagram illustrates Systema's centroid accuracy metric, which measures whether predicted profiles are closer to their correct ground-truth centroid than to other perturbation centroids.
Table 3: Key Research Reagents and Computational Tools for Perturbation Studies
| Resource Name | Type | Function/Application | Reference |
|---|---|---|---|
| Adamson Dataset | Experimental Data | Investigates endoplasmic reticulum homeostasis through targeted perturbations | [15] |
| Norman Dataset | Experimental Data | Examines cell cycle and growth processes via combinatorial perturbations | [15] |
| Replogle Dataset | Experimental Data | Genome-wide perturbation screen in RPE1 and K562 cell lines | [15] [42] |
| Frangieh Dataset | Experimental Data | Multi-modal pooled Perturb-CITE-seq screens in patient models | [42] |
| GEARS Codebase | Computational Tool | Data processing and model implementation framework | [42] |
| scGPT | Computational Model | Foundation model for single-cell multi-omics using generative AI | [15] |
| CPA | Computational Model | Compositional Perturbation Autoencoder for perturbation modeling | [15] |
| Systema GitHub | Evaluation Framework | Implementation of Systema evaluation framework | [15] |
The introduction of Systema has profound implications for perturbation effect prediction research, particularly in the evaluation of single-cell foundation models (scFMs). By revealing that current methods struggle to generalize beyond systematic variation, Systema challenges the field to develop more robust approaches that capture genuine biological effects rather than leveraging dataset-specific biases [41] [15].
Looking forward, the developers of Systema suggest that perturbation response models should be evaluated based on their biological utility—how inferred perturbation profiles help answer downstream queries about relevant cellular phenotypes [42]. Framing evaluation in terms of downstream tasks may offer a more meaningful and practical perspective than traditional metrics [42]. Emerging perturbation platforms like optical pooled screens and spatial functional genomics screens, which combine perturbation data with cell morphology, spatial context, and tissue-level features, present particularly rich opportunities for this type of evaluation [42].
For researchers and drug development professionals, Systema offers a more rigorous framework for validating computational models before their application in therapeutic discovery pipelines. By ensuring that models capture genuine perturbation-specific effects rather than systematic biases, Systema can help increase confidence in computational predictions and accelerate the translation of perturbation insights into clinical applications.
Systema represents a paradigm shift in how the field evaluates genetic perturbation response prediction methods. By moving beyond traditional metrics that are susceptible to systematic variation and introducing novel evaluation approaches that emphasize perturbation-specific effects, Systema provides a more rigorous and biologically meaningful standard for assessment. The framework's demonstration that simple baselines often outperform complex models on traditional metrics underscores the critical importance of proper evaluation methodology in advancing the field. As perturbation modeling continues to play an increasingly important role in functional genomics and therapeutic discovery, Systema offers an essential toolkit for ensuring that computational methods genuinely advance our understanding of biological systems rather than merely capturing dataset-specific biases.
In the rapidly evolving field of single-cell biology, the development of single-cell foundation models (scFMs) represents a transformative advance with profound implications for understanding cellular processes and disease mechanisms. These models, trained on millions of single-cell transcriptomes, promise to learn fundamental biological principles that generalize across diverse cell types, states, and conditions [14]. Within this context, perturbation effect prediction—the ability to accurately forecast how genetic manipulations alter cellular states—has emerged as a critical benchmark for scFM capability. However, recent rigorous evaluations reveal a sobering reality: despite significant computational investment, current scFMs frequently fail to outperform deliberately simple baselines on this crucial task [2] [3]. This performance gap highlights the urgent need to systematically evaluate the core optimization levers that govern scFM efficacy: data quality, pretraining strategies, and targeted fine-tuning methodologies.
The benchmarking evidence is striking. Multiple independent studies have demonstrated that scFM embeddings provide no consistent improvement over simpler approaches, particularly when predicting strong or atypical perturbation effects or operating under distribution shift [2] [3] [25]. These findings underscore fundamental limitations in current optimization approaches and necessitate a critical examination of how data quality, pretraining architectures, and fine-tuning protocols can be leveraged to enhance model performance. This review synthesizes current benchmarking results, analyzes the experimental methodologies underlying these findings, and identifies the most promising optimization pathways for advancing perturbation prediction capabilities in scFMs.
Recent systematic benchmarking efforts have yielded consistent findings across multiple studies and model architectures, revealing significant performance gaps in perturbation prediction tasks. The following comparative analysis synthesizes quantitative results from these evaluations.
Table 1: Performance Comparison in Double Perturbation Prediction (Norman et al. Dataset)
| Model Type | Model Name | Prediction Error (L2 Distance) | Genetic Interaction Detection | Key Limitations |
|---|---|---|---|---|
| Simple Baselines | Additive Model | Lowest | Cannot predict interactions | Serves as performance reference |
| No Change Model | Moderate | Poor performance | Predicts no change from control | |
| Single-cell Foundation Models | scGPT | Higher than baseline | No improvement over "no change" | Predictions show minimal variation across perturbations |
| scFoundation | Higher than baseline | Limited capability | Requires specific gene sets; less flexible | |
| Geneformer | Higher than baseline | Limited capability | Struggles with strong effect predictions | |
| scBERT | Higher than baseline | No improvement over "no change" | Predictions show minimal variation | |
| UCE | Higher than baseline | No improvement over "no change" | Predictions show minimal variation | |
| Other Deep Learning Models | GEARS | Higher than baseline | Mostly predicts buffering interactions | Limited variation in predictions |
| CPA | Highest | Not designed for unseen perturbations | Uncompetitive in this benchmark |
Table 2: Performance in Unseen Perturbation Prediction (Replogle et al. Datasets)
| Model/Approach | Prediction Accuracy | Data Efficiency | Computational Cost |
|---|---|---|---|
| Mean Prediction Baseline | Moderate | High | Low |
| Linear Model with Training Data Embeddings | Competitive with scFMs | High | Low |
| scGPT with In-built Decoder | Lower than or equal to baselines | Low | High |
| GEARS with In-built Decoder | Lower than or equal to baselines | Low | High |
| Linear Model with scGPT Embeddings | Similar to scGPT itself | Moderate | Moderate |
| Linear Model with Perturbation Pretraining | Highest | High | Moderate |
The consistent pattern across these benchmarks indicates that current scFMs, despite their architectural complexity and extensive pretraining, fail to demonstrate superior performance in perturbation prediction compared to simpler, more direct approaches [2]. This performance gap is particularly evident in challenging scenarios such as predicting genetic interactions or extrapolating to unseen perturbations. Notably, the "additive model," which simply sums the individual logarithmic fold changes of single perturbations to predict double perturbation effects, sets a surprisingly high performance bar that current deep learning models have not consistently surpassed [2].
The limitations extend beyond quantitative metrics to qualitative shortcomings. Most models predominantly predict "buffering" interactions (where the double perturbation effect is less than expected) and rarely correctly identify "synergistic" interactions (where the combined effect is greater than expected) [2]. Furthermore, model predictions often show insufficient variation across different perturbations, suggesting a failure to capture the specific biological consequences of distinct genetic manipulations [2].
The benchmarking studies employed rigorous experimental designs to ensure fair and informative comparisons between scFMs and baseline approaches. Understanding these methodologies is crucial for interpreting the results and designing future optimization strategies.
The evaluation of double perturbation prediction followed a standardized protocol using data from Norman et al., which measured transcriptional responses to 100 individual gene perturbations and 124 paired gene perturbations in K562 cells using CRISPR activation technology [2]. The experimental workflow included:
Data Partitioning: Models were fine-tuned on all 100 single perturbations and a randomly selected subset of 62 double perturbations (50%), with the remaining 62 double perturbations held out for testing. This process was repeated across five different random partitions to ensure robustness.
Evaluation Metrics: Primary evaluation employed L2 distance between predicted and observed expression values for the 1,000 most highly expressed genes. Additional validation used Pearson delta measure and L2 distances for other gene subsets (most highly expressed or most differentially expressed genes).
Genetic Interaction Analysis: Genetic interactions were operationally defined as double perturbation phenotypes that differed from additive expectations beyond what would be expected under a normal distribution null model. At a 5% false discovery rate, 5,035 significant genetic interactions were identified from the complete dataset.
Interaction Classification: Predictions were categorized into interaction types: "buffering" (weaker than additive effect), "synergistic" (stronger than additive effect), or "opposite" (qualitatively different effect).
The assessment of model capability to generalize to completely novel perturbations employed data from Replogle et al. (K562 and RPE1 cell lines) and Adamson et al. (K562 cells) [2]. The methodology included:
Baseline Construction: A simple linear model framework was developed where read-out genes are represented by K-dimensional vectors and perturbations by L-dimensional vectors, with the model solving:
argmin_W ||Y_train - (GWP^T + b)||₂²
where Y_train contains gene expression values, G represents gene embeddings, P represents perturbation embeddings, and b is the vector of row means [2].
Embedding Extraction: Gene embedding matrices were extracted from scFoundation and scGPT, while perturbation embedding matrices were extracted from GEARS to test whether pretrained representations contained valuable biological knowledge.
Cross-Cell Line Validation: Models were pretrained on one cell line (e.g., K562) and evaluated on another (e.g., RPE1) to assess generalization capability across biological contexts.
A more recent approach introduced "closed-loop" fine-tuning that incorporates experimental perturbation data directly into the model refinement process [26]. This methodology includes:
Model Fine-Tuning: The base scFM (Geneformer-30M-12L) is first fine-tuned to classify cellular states (e.g., activated vs. resting T-cells) using relevant single-cell RNA sequencing data.
Perturbation Data Integration: The model is further fine-tuned with single-cell RNA sequencing data from CRISPR activation/interference screens (Perturb-seq), labeled with cellular activation status but not specific perturbation identities.
Iterative Refinement: Model performance is evaluated with incrementally increasing perturbation examples to determine the minimal data required for substantial improvement.
Therapeutic Application: The optimized model is applied to disease contexts (e.g., RUNX1-familial platelet disorder) to identify potential therapeutic targets through in silico perturbation screening.
The benchmarking results provide critical insights into the relative importance of different optimization approaches for enhancing scFM performance in perturbation prediction.
The composition and quality of training data emerge as fundamental determinants of scFM performance. Current scFMs are typically pretrained on large-scale single-cell atlases containing tens of millions of cells spanning diverse tissues and conditions [14]. While this approach captures broad biological variability, it appears insufficient for excelling at perturbation prediction. Several key findings highlight this limitation:
Perturbation-Specific Data Trumps Scale: Linear models using embeddings pretrained on perturbation data consistently outperformed those using scFM embeddings pretrained on broader single-cell atlas data [2]. This suggests that data relevance may be more important than dataset size for this specific task.
Distribution Shift Vulnerability: scFM embeddings show particularly poor performance under distribution shift, indicating that models trained on "normal" cellular states struggle to generalize to strongly perturbed conditions [3].
Data Quality Challenges: Single-cell data suffers from batch effects, technical noise, and variable processing steps that can introduce confounding patterns [14]. While some models demonstrate robustness to these artifacts, data quality issues likely contribute to the performance limitations in perturbation prediction.
Current scFMs predominantly employ transformer architectures, adapting either BERT-like encoder designs or GPT-inspired decoder frameworks [14]. However, benchmarking results suggest that architectural sophistication alone does not guarantee performance advantages for perturbation prediction:
Tokenization Challenges: Unlike natural language, gene expression data lacks natural sequential ordering, requiring artificial tokenization strategies such as ranking genes by expression level or binning expression values [14]. These arbitrary orderings may obscure biologically meaningful gene-gene relationships crucial for accurate perturbation prediction.
Attention Mechanism Limitations: While self-attention layers theoretically enable models to learn complex gene regulatory relationships, current implementations appear insufficient for capturing the intricate biological logic underlying genetic interactions [2].
Embedding Utility: The fact that linear models using extracted scFM embeddings perform similarly to the full models suggests that these embeddings may not capture the specialized information needed for perturbation prediction [2].
Fine-tuning strategies represent the most promising optimization lever, with "closed-loop" approaches demonstrating significant performance improvements:
Closed-Loop Advantage: Incorporating experimental perturbation data during fine-tuning dramatically improves prediction accuracy, increasing positive predictive value from 3% to 9% in T-cell activation models while also enhancing sensitivity (76% vs. 48%) and specificity (81% vs. 60%) [26].
Data Efficiency: Performance improvements saturate with approximately 20 perturbation examples, suggesting that even modest experimental investments can substantially enhance model accuracy [26].
Task-Specific Adaptation: Fine-tuning protocols that explicitly incorporate perturbation effects alongside cellular state information enable models to better capture the causal relationships between genetic manipulations and phenotypic outcomes [26].
Table 3: Impact of Closed-Loop Fine-Tuning on Prediction Metrics (T-cell Activation Model)
| Evaluation Metric | Open-Loop ISP | Closed-Loop ISP | Relative Improvement |
|---|---|---|---|
| Positive Predictive Value (PPV) | 3% | 9% | 3x |
| Negative Predictive Value (NPV) | 98% | 99% | 1% |
| Sensitivity | 48% | 76% | 58% |
| Specificity | 60% | 81% | 35% |
| AUROC | 0.63 | 0.86 | 37% |
Successful implementation of scFM optimization requires specific computational resources, datasets, and analytical tools. The following table details key components of the experimental pipeline for perturbation prediction studies.
Table 4: Essential Research Reagents and Computational Resources
| Resource Type | Specific Examples | Function and Application |
|---|---|---|
| Single-cell Foundation Models | Geneformer-30M-12L, scGPT, scBERT, UCE, scFoundation | Base models for fine-tuning and perturbation prediction |
| Benchmark Datasets | Norman et al. (CRISPRa in K562), Replogle et al. (CRISPRi in K562/RPE1), Adamson et al. (K562 perturbations) | Standardized data for model training and evaluation |
| Evaluation Frameworks | PertEval-scFM | Standardized benchmarking of perturbation prediction performance |
| Computational Libraries | TensorFlow, PyTorch, Optuna, Ray Tune | Model implementation, training, and hyperparameter optimization |
| Specialized Toolkits | Intel OpenVINO, ONNX Runtime | Model optimization and acceleration for efficient inference |
| Biological Databases | CZ CELLxGENE, Human Cell Atlas, PanglaoDB, GEO/SRA | Sources of diverse single-cell data for pretraining and fine-tuning |
The comprehensive benchmarking of single-cell foundation models for perturbation effect prediction reveals a critical performance gap between current model capabilities and practical biological applications. Despite their architectural sophistication and extensive pretraining, scFMs consistently fail to outperform simpler, more direct approaches for predicting transcriptional responses to genetic perturbations. This limitation underscores the need for more targeted optimization strategies that prioritize data quality, biological relevance, and specialized fine-tuning over sheer model scale and diversity of pretraining data.
The most promising path forward centers on "closed-loop" optimization frameworks that iteratively integrate experimental perturbation data into model refinement. This approach, which increases positive predictive value three-fold with relatively modest data requirements, demonstrates the power of combining foundational pretraining with targeted, task-specific fine-tuning. Future advances will likely depend on developing specialized architectures specifically designed for capturing causal relationships in biological systems, coupled with higher-quality perturbation datasets that more comprehensively probe genetic interactions across diverse cellular contexts.
For researchers and drug development professionals, these findings suggest a pragmatic approach to leveraging scFMs for perturbation prediction. Currently, simpler baseline models provide competitive performance with significantly lower computational costs. However, as optimization methodologies mature—particularly through improved data quality, specialized pretraining, and targeted fine-tuning—scFMs hold immense potential to eventually deliver on their promise as virtual cells for in silico therapeutic discovery and biological mechanism elucidation.
The ability to accurately predict transcriptional outcomes of genetic perturbations is a central challenge in functional genomics, with profound implications for understanding gene function, mapping regulatory networks, and accelerating therapeutic discovery [15]. Single-cell RNA sequencing technologies, particularly high-throughput perturbation screens like Perturb-seq, have generated vast amounts of data on cellular responses to genetic interventions [43]. In response, numerous computational methods—especially single-cell foundation models (scFMs) and other deep learning approaches—have been developed to predict effects of both single-gene and combinatorial perturbations, with the ultimate goal of generalizing to entirely unseen genetic interventions [15] [2].
This comparison guide provides an objective performance evaluation of state-of-the-art perturbation prediction methods, with particular focus on their capabilities for predicting double-gene perturbation effects and generalizing to unseen perturbations. We synthesize recent benchmarking studies and experimental validations to offer researchers, scientists, and drug development professionals a clear assessment of the current landscape, methodological considerations, and practical performance expectations for these tools.
Table 1: Performance comparison of methods predicting double-gene perturbation effects on the Norman et al. (2019) K562 CRISPRa dataset. Prediction error is measured as L2 distance between predicted and observed expression values for the top 1,000 highly expressed genes [2].
| Method | Type | Prediction Error (L2) | Key Characteristics |
|---|---|---|---|
| Additive Model | Simple Baseline | Lowest | Sum of individual logarithmic fold changes [2] |
| No Change Model | Simple Baseline | Intermediate | Predicts same expression as control condition [2] |
| GEARS | Deep Learning | Higher than baselines | Integrates knowledge graphs [2] |
| scGPT | Foundation Model | Higher than baselines | Pretrained on single-cell data [2] |
| CPA | Deep Learning | Highest | Not designed for unseen perturbations [2] |
Table 2: Performance comparison on predicting effects of completely unseen single-gene perturbations across multiple datasets (Adamson et al., Replogle et al.). Performance metrics include Pearson correlation between predicted and actual expression changes [15] [2].
| Method | Adamson Dataset | Replogle K562 | Replogle RPE1 | Key Characteristics |
|---|---|---|---|---|
| Perturbed Mean | Best | Best | Best | Average expression across all perturbed cells [15] |
| Linear Model with Pretrained P | Competitive | Competitive | Competitive | Embeddings pretrained on perturbation data [2] |
| scGPT | Intermediate | Intermediate | Intermediate | Foundation model with biological pretraining [2] |
| GEARS | Intermediate | Intermediate | Intermediate | Uses Gene Ontology annotations [2] |
| Matching Mean | Not Applicable | Not Applicable | Not Applicable | For combinatorial perturbations only [15] |
The standard evaluation protocol for double perturbation prediction utilizes the Norman et al. (2019) dataset, which contains CRISPR activation (CRISPRa) perturbations of 100 individual genes and 124 gene pairs in K562 cells, with single-cell RNA sequencing measurements of 19,264 genes [2]. The established methodology involves:
Data Partitioning: Models are fine-tuned on all 100 single perturbations and a subset of 62 double perturbations, with the remaining 62 double perturbations held out for testing. For robustness, analyses are typically repeated across five random partitions [2].
Evaluation Metrics: Multiple metrics are employed including:
Genetic Interaction Analysis: Methods are evaluated on their ability to predict genetic interactions, defined as double perturbation phenotypes that significantly deviate from additive expectations. Performance is measured using true-positive rate and false discovery proportion curves across threshold values [2].
For assessing generalization to entirely unseen perturbations, the Systema framework has been developed to address limitations of standard evaluation metrics [15]. Key methodological components include:
Dataset Selection: Evaluation across ten single-cell perturbation datasets spanning three technologies (CRISPRi, CRISPRa, Perturb-seq) and five cell lines (K562, RPE1, etc.), including genome-wide screens [15].
Systematic Variation Control: Quantification and adjustment for systematic differences between perturbed and control cells caused by selection biases or biological confounders, which can artificially inflate performance metrics [15].
Perturbation-Specific Focus: Emphasis on evaluating the prediction of perturbation-specific effects rather than systematic variation patterns through:
Perturbation Prediction Evaluation Workflow - This diagram illustrates the comprehensive benchmarking methodology for evaluating perturbation prediction methods, highlighting the parallel assessment of double perturbation prediction and unseen perturbation generalization capabilities.
Systematic Variation Impact on Evaluation - This diagram outlines how systematic variation in perturbation datasets affects performance evaluation, explaining why simple baselines can outperform complex models and highlighting specific examples from benchmark datasets.
Table 3: Essential research reagents and computational resources for perturbation prediction studies.
| Resource | Type | Function | Example Applications |
|---|---|---|---|
| CRISPR Activation (CRISPRa) | Molecular Tool | Targeted gene overexpression using catalytically dead Cas9 fused to transcriptional activators [43] | Norman et al. K562 perturbation screen [2] |
| CRISPR Interference (CRISPRi) | Molecular Tool | Targeted gene repression using catalytically dead Cas9 fused to repressive domains [43] | Replogle et al. genome-wide screens [2] |
| Perturb-seq | Experimental Method | Combined CRISPR perturbation with single-cell RNA sequencing to measure transcriptomic effects [44] [43] | High-throughput perturbation screening [44] |
| Gene Ontology (GO) Annotations | Knowledge Base | Structured biological knowledge for gene function prediction and model generalization [2] | GEARS model extrapolation [2] |
| Protein-Protein Interaction Networks | Knowledge Graph | Prior biological network information for graph-based models [44] | GEARS and GraphReach implementations [44] |
The consistent underperformance of complex deep learning models relative to simple baselines across multiple benchmarks reveals several critical insights about the current state of perturbation prediction:
Systematic Variation Dominance: Simple baselines like the perturbed mean (average expression across all perturbed cells) perform comparably or superior to state-of-the-art methods because they effectively capture systematic differences between control and perturbed cells, which often dominate the signal in perturbation datasets [15].
Evaluation Metric Limitations: Standard evaluation metrics are highly susceptible to systematic variation, leading to overestimated performance for methods that primarily capture these consistent patterns rather than perturbation-specific effects [15].
Generalization Challenges: Models struggle to generalize beyond the systematic variation present in training data, with true zero-shot prediction of entirely novel perturbation effects remaining particularly challenging [15] [2].
For researchers applying these methods in practical settings, several methodological considerations emerge from the benchmarking results:
Baseline Implementation: Always include simple baselines (perturbed mean, additive model) as reference points when evaluating new perturbation prediction methods [15] [2].
Dataset Awareness: Understand the specific systematic variations in benchmark datasets, such as the cell cycle focus in Norman et al. or ER homeostasis focus in Adamson et al., as these strongly influence performance outcomes [15].
Evaluation Strategy: Employ comprehensive evaluation frameworks like Systema that specifically address systematic variation and focus on perturbation-specific effects rather than relying solely on standard correlation metrics [15].
Architecture Selection: Consider simpler architectures or hybrid approaches, as current evidence suggests that model complexity does not necessarily translate to improved perturbation prediction capability [2].
Recent methodological advances suggest promising directions for improving perturbation prediction:
Closed-Loop Frameworks: Incorporating experimental perturbation data during model fine-tuning has shown potential for significant performance improvements, with one study demonstrating a three-fold increase in positive predictive value compared to standard approaches [26].
Advanced Architectures: Models like PerturbNet, which utilize conditional normalizing flows to map perturbation representations to cell state distributions, show improved performance for predicting effects of completely unseen genes and can handle diverse perturbation types including small molecules and missense mutations [43].
Efficient Training Strategies: Approaches like GraphReach that optimize training perturbation selection through graph-based subset selection can substantially accelerate model development while maintaining competitive accuracy [44].
The field continues to evolve rapidly, with ongoing efforts focused on developing more robust evaluation frameworks, incorporating additional biological prior knowledge, and creating models that can genuinely generalize beyond their training data to enable accurate prediction of novel therapeutic interventions.
Single-cell foundation models (scFMs) represent a groundbreaking advancement in computational biology, applying transformer-based architectures to massive single-cell transcriptomics datasets [14]. Trained on millions of cells across diverse tissues and conditions, these models promise to learn universal biological principles that enable prediction of cellular behaviors—including responses to genetic perturbations [6] [45]. The theoretical potential is transformative: in-silico simulation of genetic intervention effects could accelerate therapeutic discovery by prioritizing experiments most likely to yield valuable biological insights [26] [40].
However, comprehensive benchmarking studies reveal a significant gap between this promise and current capabilities, particularly for predicting genetic interactions in combinatorial perturbation scenarios. This guide synthesizes evidence from recent rigorous evaluations to objectively compare scFM performance against simpler alternatives, providing researchers with evidence-based recommendations for model selection in perturbation analysis.
A critical benchmark assesses model performance in predicting expression changes after dual-gene perturbations, which requires capturing non-additive genetic interactions. As shown in Table 1, multiple scFMs fail to outperform deliberately simple baselines on this complex task.
Table 1: Performance comparison on double perturbation prediction (Norman et al. dataset)
| Model Category | Specific Models | Performance vs. Additive Baseline | Key Limitations |
|---|---|---|---|
| Single-cell Foundation Models | scGPT, scFoundation, Geneformer, UCE, scBERT | Substantially higher prediction error [2] | Predictions show minimal variation across perturbations [2] |
| Task-Specific Deep Learning | GEARS, CPA | Higher prediction error than baselines [2] | Limited ability to represent genetic interactions [2] |
| Simple Baselines | Additive model (sum of individual LFCs), No-change model | Reference performance [2] | Additive model cannot predict interactions by design [2] |
When predicting genetic interactions specifically—defined as double perturbation phenotypes that significantly deviate from additive expectations—no model outperformed the "no change" baseline. All deep learning models primarily predicted buffering interactions and rarely identified synergistic interactions correctly [2].
Benchmarking extended to covariate transfer tasks, where models trained on perturbations in one cellular context must predict effects in another context. As shown in Table 2, simple linear approaches remain highly competitive.
Table 2: Performance on unseen perturbation prediction across cell lines
| Model Type | Examples | Performance vs. Linear Baselines | Data Requirements |
|---|---|---|---|
| Foundation Models with Fine-tuning | scGPT, Geneformer | Do not consistently outperform mean prediction or linear models [2] | Extensive pretraining + task-specific fine-tuning |
| Linear Models | Equation (1) with trained embeddings | Competitive or superior to foundation models [2] | Task-specific training data only |
| Mean Prediction | Simple average | Surprisingly difficult to outperform [2] | None (most basic baseline) |
Notably, incorporating pretrained gene embeddings from scFoundation or scGPT into linear models matched or exceeded the performance of the full foundation models with their native decoders. However, the most effective approach combined linear models with perturbation embeddings pretrained on orthogonal perturbation data [2].
The benchmark revealing scFMs' struggles with genetic interactions employed rigorous methodology:
The PerturBench framework established standardized evaluation for cross-context prediction:
Diagram Title: scFM Benchmarking Workflow for Perturbation Prediction
Multiple technical factors contribute to the performance gap in genetic interaction prediction:
Fundamental data issues underlie modeling difficulties:
Promising approaches address current limitations through improved training methodologies:
Novel model designs specifically target perturbation prediction challenges:
Diagram Title: Pathways to Improve scFM Genetic Interaction Prediction
Table 3: Key experimental resources for perturbation prediction research
| Resource Category | Specific Examples | Function in Research | Key Characteristics |
|---|---|---|---|
| Benchmarking Datasets | Norman et al. (double perturbations), Replogle et al. (CRISPRi/a), PerturBench collection [2] [40] | Standardized evaluation across models | Diverse modalities, combinatorial perturbations, multiple cell types [40] |
| Evaluation Frameworks | PerturBench, PEREGGRN, PertEval-scFM [40] [46] [13] | Consistent model comparison and metric calculation | Modular design, multiple data splitting strategies, diverse metrics [40] [46] |
| Baseline Models | Additive model, No-change model, Linear models with embeddings [2] | Critical performance reference points | Simple implementation, established performance floor [2] |
| Foundation Models | Geneformer, scGPT, scFoundation, UCE, scBERT [6] [2] | Primary test subjects for advanced capability assessment | Large-scale pretraining, transformer architectures, zero-shot capabilities [6] [14] |
Current evidence demonstrates that single-cell foundation models have not yet fulfilled their potential for genetic interaction prediction, consistently failing to outperform simpler baseline methods. This performance gap stems from both technical limitations and fundamental biological data challenges.
For researchers pursuing perturbation effect prediction, evidence suggests the following strategic approach:
The field continues to evolve rapidly, with new architectural innovations and training strategies regularly emerging. However, the consistent failure of current scFMs to outperform simple baselines on genetic interaction prediction underscores that model scale alone is insufficient—future progress must address fundamental limitations in capturing biological causality and combinatorial complexity.
The emergence of single-cell foundation models (scFMs) has generated significant interest in their potential to predict transcriptional responses to genetic perturbations in silico, a capability with profound implications for basic biology and therapeutic development [46] [2]. However, realizing this potential requires robust, standardized evaluation to separate genuine methodological advancement from optimistic claims. This comparative guide examines three significant benchmarking efforts—PEREGGRN, the benchmark from Nature Methods, and PertEval-scFM—that have independently addressed this critical need. These platforms employ distinct methodologies to answer a central question: can complex machine learning models, particularly deep-learning-based scFMs, reliably outperform simple baselines in predicting perturbation effects? This analysis synthesizes their experimental protocols, findings, and resources to provide researchers with a clear understanding of the current benchmarking landscape and its consensus conclusions.
This section details the core design and experimental protocols of the three major benchmarking platforms, highlighting their unique focus areas and methodological approaches.
The PEREGGRN (PErturbation Response Evaluation via a Grammar of Gene Regulatory Networks) framework is built around a central software engine called GGRN (Grammar of Gene Regulatory Networks). Its design is inherently modular, focusing on supervised machine learning to forecast each gene's expression based on candidate regulators [46].
This benchmark study took a direct approach to evaluating several prominent deep-learning-based models, including the foundation models scGPT and scFoundation, against deliberately simple baselines [2].
PertEval-scFM is a standardized framework specifically designed for the zero-shot evaluation of single-cell foundation model embeddings for perturbation effect prediction [3] [47] [25].
The following diagram illustrates the core methodological workflow shared by these benchmarking platforms, from data input to performance evaluation.
A synthesis of results across all three benchmarking platforms reveals a consistent and striking conclusion: current complex methods, including deep-learning-based foundation models, generally fail to outperform simple baseline approaches.
The table below summarizes the quantitative findings and conclusions across the three benchmarking platforms.
Table 1: Consolidated Performance Findings Across Benchmarking Platforms
| Benchmark Platform | Top-Performing Methods | Key Comparative Finding | Performance Context |
|---|---|---|---|
| PEREGGRN [46] | Simple baselines | "Uncommon" for complex methods to outperform simple baselines | Evaluation across 11 human perturbation datasets |
| Nature Methods [2] | Additive model, Linear model, "No change" model | "None outperformed the baselines" | Double & unseen single perturbation prediction |
| PertEval-scFM [25] | Simple baseline models | "No consistent improvements" from scFM embeddings | Zero-shot prediction under distribution shift |
A critical aspect of these benchmarks is their rigorous experimental design, which includes carefully curated data resources and specific evaluation protocols to ensure fair and biologically meaningful comparisons.
Each platform employs a suite of metrics to comprehensively evaluate performance, recognizing that no single metric perfectly captures prediction utility.
Table 2: Essential Research Reagents and Resources
| Resource Name | Type | Function in Benchmarking | Example Sources/Platforms |
|---|---|---|---|
| Perturbation Datasets | Data | Provide ground-truth transcriptome changes for training & evaluation | Norman et al., Replogle et al., Adamson et al. [2] |
| Gene Networks | Prior Knowledge | Inform regulatory relationships for model training | Motif-based, co-expression networks [46] |
| Benchmarking Software | Tool | Standardize evaluation protocols & metrics | PEREGGRN, PertEval-scFM [46] [25] |
| Unified Model APIs | Tool | Enable consistent model integration & switching | BioLLM framework [8] |
| Simple Baseline Models | Method | Provide critical performance reference point | "No change", "Additive", Linear models [2] |
The consensus across multiple independent, rigorous benchmarks is clear and consistent: despite their theoretical promise and architectural complexity, current deep-learning-based foundation models and specialized expression forecasting methods have not demonstrated superior performance over simple baseline models for predicting genetic perturbation effects. This conclusion holds across various prediction tasks—single and double perturbations, seen and unseen perturbations—and is robust to the choice of evaluation metric [46] [2] [25].
These findings highlight the critical importance of standardized, neutral benchmarking in directing and evaluating methodological development in computational biology. The emergence of platforms like PEREGGRN, PertEval-scFM, and the methodologies in the Nature Methods study provides the community with the tools necessary for rigorous self-assessment. Future progress in the field will depend on acknowledging these results and focusing on developing models that can genuinely capture the biological complexity of gene regulatory systems, rather than merely increasing model parameter counts. The available evidence suggests that pretraining on large-scale perturbation data may be more beneficial than pretraining on single-cell atlas data alone, pointing to a potential pathway for future improvement [2].
Predicting how cells respond to genetic perturbations represents a significant unsolved challenge in functional genomics with profound implications for therapeutic development. Single-cell foundation models (scFMs) pre-trained on vast single-cell atlases enable in silico perturbation (ISP) predictions, simulating cellular state changes without exhaustive experimental validation. However, the true predictive power of these models remains poorly characterized, particularly their ability to generalize beyond systematic variations caused by selection biases or biological confounders. This guide objectively compares the performance of established perturbation response prediction methods through two distinct biological case studies: T-cell activation and RUNX1-Familial Platelet Disorder (RUNX1-FPD). We provide experimental protocols, quantitative performance data, and analytical frameworks to help researchers select and implement appropriate evaluation strategies for perturbation modeling in their own work.
We evaluated established perturbation response prediction methods on their ability to predict transcriptional outcomes of unseen genetic perturbations. The benchmark included three state-of-the-art methods—compositional perturbation autoencoder (CPA), GEARS, and scGPT—alongside two nonparametric baselines capturing average perturbation effects (Perturbed Mean and Matching Mean). Evaluation spanned ten single-cell perturbation datasets from six sources, covering three distinct technologies and five different cell lines, including genome-wide and combinatorial two-gene perturbation screens [15].
Performance was assessed using metrics previously established in literature:
The Systema framework was developed to address limitations in standard evaluation approaches. This framework (1) mitigates systematic biases by focusing on perturbation-specific effects and (2) provides interpretable readouts of method ability to reconstruct the perturbation landscape, differentiating predictions that merely replicate systematic effects from those capturing biologically informative responses [15].
A critical advancement in scFM training involves "closing the loop" by incorporating experimental perturbation data during model fine-tuning. This approach extends scFMs beyond initial pre-training by iteratively refining predictions using observed perturbation outcomes, creating a feedback cycle that significantly enhances biological accuracy [26].
The following diagram illustrates this integrated computational and experimental workflow:
Figure 1: Closed-Loop scFM Framework - Integrating experimental data to improve prediction accuracy
Biological Context: T-cell activation through CD3-CD28 stimulation or PMA/ionomycin treatment represents a well-characterized biological system with applications in cancer immunotherapy, autoimmunity, and infectious disease. This case study provides a robust benchmark for evaluating perturbation prediction accuracy [26].
Computational Methods:
ISP Implementation: The fine-tuned model performed ISP across 13,161 genes, simulating both gene overexpression (CRISPRa) and knockout (CRISPRi) to model transcriptional outcomes.
Table 1: T-Cell Activation Prediction Performance Metrics
| Method | Positive Predictive Value | Negative Predictive Value | Sensitivity | Specificity | AUROC |
|---|---|---|---|---|---|
| Open-Loop ISP | 3% | 98% | 48% | 60% | 0.63 |
| Differential Expression | 3% | 78% | 40% | 50% | N/A |
| Closed-Loop ISP | 9% | 99% | 76% | 81% | 0.86 |
| ISP + DE Overlap | 7% | N/A | N/A | N/A | N/A |
The benchmarking revealed that open-loop ISP and differential expression analysis identified largely non-overlapping gene sets, with only 2.9% of predictions overlapping between methods. Notably, only 21 genes were predicted by both methods to have effects in the same direction: 11 shifting toward activation and 10 toward resting state. These overlapping genes represented key T-cell activation regulators including IL2RA, VAV1, ZAP70, CD3D, CD3G, and LCP2 [26].
A key finding was that closed-loop performance improved dramatically with just 10 perturbation examples (sensitivity: 61%, specificity: 66%) and approached saturation at approximately 20 examples (sensitivity: 76%, specificity: 79%). Performance did not improve significantly with additional examples beyond this point, indicating that even modest experimental validation can substantially enhance closed-loop ISP accuracy compared to baseline ISP [26].
Clinical Background: RUNX1-FPD is a rare autosomal dominant disorder caused by germline mutations in the RUNX1 gene, characterized by thrombocytopenia, platelet dysfunction, and approximately 44% lifetime risk of hematological malignancies, primarily myelodysplastic syndrome and acute myeloid leukemia [48] [49]. With an estimated 18,000-20,000 affected individuals in the United States, this condition represents a significant unmet medical need as no interventions currently exist to prevent leukemic progression [50] [26].
Experimental Models:
Pathophysiological Insights: Single-cell RNA sequencing of FPD bone marrow cells (122,021 FPD vs. 48,781 healthy cells) revealed altered hematopoietic differentiation with increased monocyte and T-cell populations, decreased megakaryocyte-erythroid progenitors, and upregulated inflammatory pathways including TNF-α/NF-κB, IFN-γ response, and TGF-β signaling [50].
Mechanistic investigation identified CD74 as a master regulator elevated in preleukemic RUNX1-FPD, driving inflammation through mTOR and JAK/STAT pathway activation. CD74-mediated signaling was exaggerated in RUNX1-FPD hematopoietic stem and progenitor cells compared to healthy controls, leading to increased cytokine production [50].
The following diagram illustrates the key signaling pathways and therapeutic intervention points:
Figure 2: RUNX1-FPD Signaling & Therapeutic Targeting - Key pathways and intervention strategies
Computational Target Discovery: Application of the closed-loop framework to RUNX1-FPD identified eight genes with available small molecule inhibitors that could shift RUNX1-knockout HSCs toward a control-like state. From these, four key therapeutic pathways emerged:
Experimental Validation: Genetic and pharmacological targeting of CD74 with ISO-1, and its downstream targets JAK1/2 and mTOR with ruxolitinib and sirolimus respectively, reversed RUNX1-FPD differentiation defects in vitro and in vivo and reduced inflammation. These interventions suppressed the exaggerated CD74 signaling, normalized mTOR and JAK/STAT pathway activation, and reduced cytokine production [50].
Table 2: RUNX1-FPD Therapeutic Targets & Experimental Outcomes
| Therapeutic Target | Experimental Agent | Experimental Model | Key Outcomes |
|---|---|---|---|
| CD74 Signaling | ISO-1 | Primary patient BM cells, in vivo models | Reduced inflammation, reversed differentiation defects |
| mTOR Pathway | Sirolimus | Primary patient BM cells, in vivo models | Restored megakaryocytic differentiation, reduced cytokine production |
| JAK/STAT Pathway | Ruxolitinib | Primary patient BM cells, in vivo models | Suppressed inflammatory signaling, improved hematopoietic function |
| RUNX1 Stabilization | Proteasomal inhibition | Patient-derived iPSCs, AML blood cells | Enhanced RUNX1 levels, improved megakaryocytic differentiation |
Table 3: Essential Research Materials for Perturbation & FPD Studies
| Reagent/Category | Specific Examples | Research Application | Functional Role |
|---|---|---|---|
| scFM Platforms | Geneformer-30M-12L, scGPT, GEARS, CPA | In silico perturbation prediction | Base models for predicting transcriptional responses to genetic perturbations |
| Genetic Perturbation Tools | CRISPRi/a, Perturb-seq | T-cell activation screens | High-throughput functional genomic screening to validate predictions |
| RUNX1-FPD Models | Primary patient HSPCs, RUNX1-engineered HSCs, Patient-derived iPSCs | Disease modeling, drug screening | Physiologically relevant systems for studying disease mechanisms and therapies |
| Therapeutic Compounds | ISO-1, Ruxolitinib, Sirolimus | Target validation, functional rescue | Pharmacological probes for pathway inhibition and therapeutic assessment |
| Analytical Tools | Systema framework, AUCell, GSEA | Method evaluation, pathway analysis | Benchmarking prediction accuracy and identifying biologically meaningful effects |
The case studies reveal that standard evaluation metrics can be misleading due to systematic variation—consistent transcriptional differences between perturbed and control cells arising from selection biases or biological confounders. In the Adamson (endoplasmic reticulum homeostasis) and Norman (cell cycle and growth) datasets, systematic differences in pathway activities between perturbed and control cells significantly influenced predictive performance [15].
The closed-loop framework demonstrated substantial improvement across both case studies, with the most significant gains in positive predictive value. This approach effectively addresses the limitation of open-loop predictions merely capturing average perturbation effects rather than perturbation-specific biology.
A critical distinction emerges between technical generalization (performance on unseen perturbations within similar biological contexts) and biological generalization (performance across different cell types and disease states). While methods showed reasonable technical generalization in T-cell activation, biological generalization across the hematopoiesis-to-inflammation spectrum of RUNX1-FPD presented greater challenges, highlighting the need for domain-specific fine-tuning and biological context incorporation.
For RUNX1-FPD, the identification of the CD74 signaling axis and successful pharmacological targeting with repurposed JAK1/2 and mTOR inhibitors provides a promising near-term therapeutic strategy. The computational prediction of protein kinase C and phosphoinositide 3-kinase as additional targets offers expanded opportunities for intervention [50] [26].
Based on our comparative analysis, we recommend:
Adopt Systema Framework: Implement the Systema evaluation framework or similar approaches that control for systematic variation when benchmarking perturbation prediction methods.
Prioritize Closed-Loop Implementation: Incorporate experimental perturbation data during model fine-tuning, as even 10-20 validated examples can significantly enhance prediction accuracy.
Contextualize Performance Metrics: Interpret predictive performance in the context of systematic variation specific to each biological system and experimental design.
Leverage Cross-Method Consensus: Consider genes identified by both ISP and differential expression analysis as high-confidence targets, as they demonstrate substantially higher positive predictive value.
Validate in Disease-Relevant Models: Employ physiological systems such as primary patient cells and genetically engineered HSCs for target validation, particularly for rare diseases where samples are scarce.
The integration of sophisticated computational prediction with rigorous experimental validation through closed-loop frameworks represents the most promising path toward realizing the potential of "virtual cell" models for biomedical discovery and therapeutic development.
The current state of perturbation effect prediction is one of recalibration. While scFMs represent a significant technological ambition, consistent benchmarking reveals they have not yet surpassed the predictive power of deliberately simple models for core tasks. The critical takeaways are threefold: first, systematic biological and technical variation in datasets poses a major challenge that inflates standard performance metrics; second, new evaluation frameworks like Systema are essential to distinguish true biological insight from data artifacts; and third, the emerging 'closed-loop' approach, which iteratively integrates experimental perturbation data into model fine-tuning, demonstrates a tangible path to substantially improved accuracy. The future of the field hinges on developing more robust models that can genuinely generalize to novel biology, coupled with transparent and rigorous benchmarking. For biomedical research, this evolving capability holds the long-term promise of accelerating therapeutic discovery, particularly for rare diseases where experimental screening is most challenging.