The scFM Reality Check: Evaluating Perturbation Effect Prediction for Biomedical Discovery

Jacob Howard Nov 27, 2025 424

Single-cell foundation models (scFMs) promise to predict transcriptomic responses to genetic perturbations, offering a powerful in silico tool for drug target discovery and functional genomics.

The scFM Reality Check: Evaluating Perturbation Effect Prediction for Biomedical Discovery

Abstract

Single-cell foundation models (scFMs) promise to predict transcriptomic responses to genetic perturbations, offering a powerful in silico tool for drug target discovery and functional genomics. This article synthesizes recent rigorous benchmarking studies that reveal a critical gap: current scFMs often fail to outperform simple linear baselines, struggling to generalize beyond systematic biases in training data. We explore the methodological underpinnings of leading models like scGPT, Geneformer, and GEARS, the emerging 'closed-loop' fine-tuning paradigm that incorporates experimental data to enhance accuracy and a new evaluation framework, Systema, designed for biologically meaningful assessment. For researchers and drug development professionals, this review provides a crucial guide to the current capabilities, limitations, and optimal application of perturbation prediction models, highlighting a pivotal moment of recalibration and future potential in the field.

The Promise and Challenge of In Silico Perturbation Screening

The quest to create a faithful in silico model of a cell—a "virtual cell"—has long been a goal of computational biology. Such a model promises to revolutionize drug discovery by enabling researchers to simulate and predict the effects of genetic and chemical perturbations safely and economically, thereby accelerating target identification [1]. A core test for these models is the accurate prediction of transcriptional responses to genetic perturbations, a task that single-cell foundation models (scFMs), inspired by large language models, were expected to excel at [2].

However, recent rigorous benchmarking studies have revealed a significant performance gap. This guide provides an objective comparison of the current state of perturbation effect prediction, focusing on the empirical evaluation of scFMs against simpler baseline models. The findings underscore a critical moment in the field: the need for more reliable evaluation metrics and specialized models to realize the full potential of virtual cells for target discovery [3] [2] [4].

Performance Benchmarking: scFMs vs. Simple Baselines

Key Findings from Recent Benchmarking Studies

Independent benchmarks have consistently shown that current deep-learning-based scFMs do not outperform deliberately simple baseline models in predicting perturbation effects [3] [2].

Table 1: Summary of Model Performance on Perturbation Prediction Tasks

Model Category	Representative Models	Performance on Double Perturbation Prediction	Performance on Unseen Perturbation Prediction	Key Limitations
Single-Cell Foundation Models (scFMs)	scGPT, scFoundation, scBERT, Geneformer, UCE [2]	Underperformed the additive baseline; higher prediction error (L2 distance) [2]	Did not consistently outperform the "mean prediction" or simple linear baseline [2]	Struggles with strong/atypical effects and distribution shifts; high computational cost [3] [2]
Other Deep Learning Models	GEARS, CPA [2]	Underperformed the additive baseline [2]	GEARS did not consistently outperform baselines [2]	Predictions vary less than ground truth [2]
Simple Baseline Models	"No Change", "Additive", Linear Model [2]	"Additive" baseline had the lowest prediction error [2]	Simple linear model and "mean prediction" were highly competitive or superior [2]	Incapable of representing complex biological interactions [2]

A study published in Nature Methods (2025) directly compared five foundation models and two other deep learning models against simple baselines for predicting transcriptome changes after single or double gene perturbations. The study concluded that "none outperformed the baselines" [2]. Similarly, the PertEval-scFM benchmarking framework found that zero-shot scFM embeddings "do not provide consistent improvements over baseline models, especially under distribution shift" and that all benchmarked models struggled with predicting strong or atypical perturbation effects [3].

Performance on Specific Prediction Tasks

Double Perturbation Prediction: In a benchmark using data from Norman et al. where 124 pairs of genes were perturbed, all deep learning models had a substantially higher prediction error (L2 distance) than the "additive" baseline, which simply sums the individual logarithmic fold changes of single perturbations [2].

Genetic Interaction Prediction: When tasked with predicting synergistic or buffering genetic interactions, none of the deep learning models performed better than the "no change" baseline, which always predicts the control condition [2].

Unseen Perturbation Prediction: For predicting the effects of entirely new perturbations, a simple linear model (or even just predicting the mean of the training data) was not consistently outperformed by any of the deep learning models, including those designed for this task [2].

Experimental Protocols and Evaluation Metrics

Benchmarking Workflow and Model Evaluation

The following diagram illustrates the standard workflow for benchmarking perturbation prediction models, as used in recent critical studies [2].

Detailed Benchmarking Methodology

The protocols below are synthesized from the methodologies of PertEval-scFM and the Nature Methods benchmark [3] [2].

1. Data Sourcing and Preprocessing

Datasets: Benchmarks typically use publicly available Perturb-seq datasets, such as those from Norman et al. (double gene perturbations in K562 cells), Replogle et al. (CRISPRi in K562 and RPE1 cells), and Adamson et al. (single gene perturbations in K562 cells) [2].
Preprocessing: Gene expression values are log-transformed. Data is partitioned into training and held-out test sets, often with multiple random splits (e.g., 5 times) for robustness [2].

2. Model Training and Fine-tuning

Baseline Models: Simple models like the "no change" (predicts control expression) and "additive" (sums single perturbation LFCs) are established without using the complex perturbation data for training [2].
Deep Learning Models: scFMs and other deep learning models are fine-tuned on the training set, which includes single perturbations and a portion of the double perturbations [2].

3. Prediction and Evaluation

Core Prediction Task: Models predict gene expression values (e.g., for the 1,000 most highly expressed genes) after one or more perturbations [2].
Primary Metrics:
- L2 Distance: The Euclidean distance between predicted and observed expression values. A lower value indicates better performance [2].
- Genetic Interaction Detection: Models are evaluated on their ability to identify non-additive interactions (synergistic, buffering) by comparing true-positive rates and false discovery proportions against a ground truth derived from statistical significance testing [2].

The Scientist's Toolkit: Essential Research Reagents and Data

Building and evaluating virtual cell models requires specific types of data and computational resources. The following table lists key "research reagents" for this field.

Table 2: Essential Research Reagents and Data for Virtual Cell Development

Item Name	Type	Function in Virtual Cell Research	Example Sources/Formats
A Priori Knowledge	Data Pillar	Encapsulates fundamental biological mechanisms from existing literature; foundation for model construction [5].	Text-based literature, molecular databases (e.g., Gene Ontology [2])
Static Architecture Data	Data Pillar	Provides a snapshot of cellular structures; essential for defining the model's spatial and morphological context [5].	Cryo-EM, super-resolution imaging, spatial omics data
Dynamic States Data	Data Pillar	Captures cellular changes over time or after perturbation; critical for training predictive models [5].	Perturb-seq, perturbation proteomics, time-series omics data [5]
Benchmarking Datasets	Data Resource	Standardized datasets used to evaluate and compare model performance objectively [3] [2].	Norman et al., Replogle et al., Adamson et al. datasets [2]
Linear Baseline Models	Computational Tool	Simple, interpretable models that serve as a critical baseline for evaluating complex scFMs [2].	"No change", "Additive", Linear regression models [2]

A Confounding Factor: The Role of Metric Calibration

While the benchmarks above suggest limited performance for scFMs, it is critical to consider the tools used for evaluation. Recent research from Shift Bioscience indicates that concerns about model reliability may be partly due to metric miscalibration [4].

Their study argues that common evaluation metrics often struggle to distinguish robust predictions from uninformative ones, particularly for weaker genetic perturbations. When using a newly calibrated framework involving rank-based and Differentially Expressed Gene (DEG)-aware metrics, virtual cell models demonstrated clear and consistent improvements over traditional uninformative baselines [4]. This highlights that the choice and calibration of evaluation metrics are as important as the model architecture itself.

The Path Forward: Principles for Future Development

The current state of perturbation prediction reveals several requirements for future progress. The diagram below outlines a proposed closed-loop framework for developing more robust virtual cells [5].

This framework emphasizes continuous learning and is built upon three essential data pillars [5]:

A Priori Knowledge: Existing fragmented biological knowledge.
Static Architecture: High-resolution, spatial snapshots of cellular structures.
Dynamic States: Data from systematic perturbations, which is currently the most limited and critical pillar.

Future efforts must prioritize generating high-quality, diverse perturbation data and developing biologically-grounded, well-calibrated benchmarks to guide the development of virtual cells that can truly accelerate target discovery [3] [1] [5].

Single-cell RNA sequencing (scRNA-seq) has revolutionized biological research by enabling high-resolution profiling of gene expression at the individual cell level, revealing cellular heterogeneity, developmental trajectories, and disease mechanisms that were previously obscured in bulk measurements [6] [7]. The exponential growth of single-cell transcriptomics data has catalyzed the development of single-cell foundation models (scFMs)—large-scale machine learning models pre-trained on millions of cells—with the promise of learning universal biological principles and accelerating discovery across diverse applications [6] [8].

These models, including prominent examples like scGPT, Geneformer, and scFoundation, adapt transformer architectures and other advanced neural network designs to analyze scRNA-seq data. They aim to capture complex gene-gene relationships and cellular states during pre-training, which can then be leveraged for downstream tasks with minimal additional task-specific training (fine-tuning) or even used directly (zero-shot) [6] [9]. Particularly compelling is their potential application in perturbation effect prediction—using computational models to forecast how cells will respond to genetic or chemical perturbations, which is crucial for understanding disease mechanisms and identifying therapeutic targets [2] [10].

However, as these models proliferate, rigorous benchmarking studies have raised critical questions about their actual performance relative to established, simpler methods, especially for predicting perturbation responses [2] [9] [10]. This guide provides an objective comparison of three leading scFMs—scGPT, Geneformer, and scFoundation—synthesizing evidence from recent comprehensive evaluations to help researchers navigate this rapidly evolving landscape.

Model Architectures and Pre-training Approaches

Single-cell foundation models employ distinct architectural designs and pre-training strategies to learn from scRNA-seq data, which presents unique challenges of high dimensionality, sparsity, and technical noise [6] [7].

Architectural Foundations and Input Representations

The following table compares the core architectural characteristics and pre-training configurations of scGPT, Geneformer, and scFoundation.

Table 1: Architectural and Pre-training Comparison of scFMs

Feature	scGPT	Geneformer	scFoundation
Model Architecture	Transformer Encoder	Transformer Encoder	Asymmetric Encoder-Decoder
Parameters	~50 million	~40 million	~100 million
Pre-training Dataset Size	~33 million cells	~30 million cells	~50 million cells
Input Gene Count	1,200 HVGs	2,048 ranked genes	~19,000 genes
Value Representation	Value binning	Ranking	Value projection
Gene Embedding	Lookup Table	Lookup Table	Lookup Table
Positional Embedding	No	Yes	No
Primary Pre-training Task	Masked Gene Modeling (MSE loss)	Masked Gene Modeling (CE loss)	Read-depth-aware MGM (MSE loss)

A key differentiator among scFMs is how they handle input representation. scRNA-seq data consists of both gene identity and their expression values, requiring specialized tokenization approaches [6] [7]:

Rank-based discretization (Geneformer): Genes are ranked by expression within each cell, emphasizing relative expression patterns and reducing batch effects.
Bin-based discretization (scGPT): Expression values are grouped into predefined bins, balancing resolution preservation with sequence modeling efficiency.
Value projection (scFoundation): Continuous expression values are projected into embeddings, maintaining full data resolution without discretization [7].

Model Architecture Workflow

The diagram below illustrates the typical workflow for processing single-cell data in transformer-based scFMs, from input representation to output embedding.

Benchmarking Methodologies for Perturbation Prediction

Evaluating scFMs for perturbation effect prediction requires standardized benchmarks that assess their ability to predict transcriptomic changes after genetic perturbations. Key benchmarking frameworks have emerged, employing consistent datasets and metrics for fair comparison [2] [10].

Experimental Protocols and Datasets

Benchmarking studies typically employ a unified experimental protocol to evaluate model performance:

Training Configuration: Models are fine-tuned on datasets containing single genetic perturbations, then evaluated on their ability to predict effects of unseen single or double perturbations.
Data Sources: Common benchmark datasets include:
- Norman et al. data: CRISPR activation (CRISPRa) in K562 cells, measuring 100 single-gene and 124 double-gene perturbations.
- Adamson et al. data: CRISPR interference (CRISPRi) in K562 cells with single-gene perturbations.
- Replogle et al. data: Genome-wide CRISPRi screens in K562 and RPE1 cell lines [2] [10].
Evaluation Metrics:
- L2 Distance: Measures direct difference between predicted and observed expression values.
- Pearson Delta: Correlation between predicted and observed differential expression (perturbed vs. control).
- Genetic Interaction Prediction: Assesses ability to identify non-additive effects in double perturbations [2].

Benchmarking Experimental Workflow

The following diagram illustrates the standard workflow for benchmarking scFMs on perturbation prediction tasks.

Performance Comparison: scFMs vs. Baselines

Recent comprehensive benchmarks have yielded surprising results regarding the performance of scFMs compared to simpler methods for perturbation prediction and other tasks.

Perturbation Effect Prediction Accuracy

The table below summarizes quantitative results from multiple studies comparing scFMs against baseline methods for predicting perturbation effects.

Table 2: Perturbation Prediction Performance Across Datasets (Pearson Delta Metric)

Model	Norman Dataset	Adamson Dataset	Replogle K562	Replogle RPE1	Genetic Interaction AUC
scGPT	0.554	0.641	0.327	0.596	0.62
Geneformer*	0.521	0.588	0.305	0.562	0.59
scFoundation	0.459	0.552	0.269	0.471	0.55
Additive Baseline	0.670	0.712	0.425	0.665	N/A
Train Mean Baseline	0.557	0.711	0.373	0.628	0.64
Random Forest + GO	0.586	0.739	0.480	0.648	0.71

Note: Geneformer repurposed with linear decoder; results marked with * indicate models not specifically designed for this task. Data synthesized from [2] [10].

The results reveal a consistent pattern: deliberately simple baselines frequently match or exceed the performance of sophisticated foundation models. The additive model (summing individual logarithmic fold changes for double perturbations) and simple mean-based predictors demonstrate particularly strong performance, while tree-based models with biological prior knowledge (like Gene Ontology features) achieve the highest accuracy [2] [10].

Zero-Shot Performance on Fundamental Tasks

Beyond perturbation prediction, studies evaluating zero-shot performance—using pre-trained models without any task-specific fine-tuning—reveal further limitations:

Cell Type Annotation: In clustering cells by type, both scGPT and Geneformer underperform established methods like scVI and Harmony, and are sometimes outperformed by simple Highly Variable Genes (HVG) selection [9] [11].
Batch Integration: For removing technical batch effects while preserving biological variation, scFMs show inconsistent results, with Geneformer particularly struggling to effectively integrate batches [9].
Gene Expression Imputation: Assessing the core pre-training task of predicting masked gene expression reveals that scFMs often learn simplistic patterns, such as predicting median expression values rather than context-specific expression [11].

Analysis of Performance Gaps and Limitations

The consistent underperformance of scFMs relative to simpler methods stems from several fundamental challenges in model design and training.

Key Limitations in Current scFMs

Ineffective Knowledge Transfer: The biological knowledge captured during large-scale pre-training does not appear to transfer effectively to the specific task of perturbation prediction. As one study notes, "pretraining on the single-cell atlas data provided only a small benefit over random embeddings" [2].
Architectural Misalignment: Transformer architectures, designed for natural language, may not be optimally suited for representing biological systems. The quadratic computational complexity of self-attention also limits scalability to full transcriptomes [7].
Simplistic Pre-training Objectives: Models trained primarily on masked gene prediction may learn to impute housekeeping genes but fail to capture deeper regulatory relationships necessary for predicting perturbation effects [11].
Data Quality and Variance Issues: Benchmark datasets often exhibit low perturbation-specific variance relative to technical noise, making it difficult to train and evaluate models effectively [10].

Emerging Alternative Approaches

Promising alternatives are emerging to address these limitations:

Hybrid Models: Combining foundation model embeddings with simpler predictive models (e.g., using scGPT's gene embeddings in Random Forests) sometimes improves performance over using either approach alone [10].
Specialized Architectures: New models like GeneMamba replace transformers with state space models, offering linear computational complexity and improved long-range dependency capture [7].
Classical Methods: Gaussian process-based approaches like GPerturb provide competitive prediction accuracy with greater interpretability and inherent uncertainty quantification [12].

The Scientist's Toolkit: Essential Research Reagents

The table below catalogues key computational tools and datasets essential for conducting research in single-cell perturbation modeling.

Table 3: Essential Research Reagents for scFM and Perturbation Modeling

Resource Name	Type	Primary Function	Relevance to scFM Research
Perturb-seq Datasets	Data	Provides ground truth perturbation responses	Gold-standard benchmarks for model evaluation [2] [10]
CELLxGENE	Data Platform	Curated single-cell data repository	Source of diverse pre-training and evaluation data [9]
BioLLM	Software Framework	Unified interface for diverse scFMs	Standardizes model access and evaluation [8]
PertEval-scFM	Benchmarking Framework	Standardized evaluation for perturbation prediction	Enables fair model comparison [3] [13]
Gene Ontology Annotations	Knowledge Base	Functional gene relationships	Provides biological prior knowledge for feature engineering [10]
GPerturb	Modeling Tool	Gaussian process-based perturbation modeling	Interpretable alternative to deep learning approaches [12]

The current generation of single-cell foundation models represents a significant technical achievement in processing large-scale biological data, yet rigorous benchmarking reveals they have not yet fulfilled their promise for perturbation effect prediction. The consistent finding that simpler models frequently outperform sophisticated scFMs underscores the immaturity of this field and highlights the need for more biologically-grounded architectures and training approaches [2] [9] [10].

For researchers and drug development professionals, practical implications include:

Method Selection: Carefully evaluate scFMs against simple baselines for specific applications rather than assuming superior performance.
Task-Specific Modeling: Choose models based on task requirements—foundation models may excel for some applications while simpler methods work better for perturbation prediction.
Hybrid Approaches: Consider leveraging scFM embeddings as features in simpler, more interpretable models.

Future development should focus on creating more biologically plausible architectures, improving pre-training objectives to capture causal relationships, and developing higher-quality benchmarking datasets with greater perturbation effect sizes. As the field matures, the integration of multi-omic data and explicit biological knowledge may help bridge the current performance gap, potentially realizing the transformative potential of foundation models for therapeutic discovery.

The emergence of single-cell foundation models (scFMs) has generated significant excitement in computational biology, promising a unified framework to decipher the complex language of cellular processes. Trained on millions of single-cell transcriptomes using transformer architectures inspired by natural language processing, these models theoretically learn fundamental biological principles that can be adapted to various downstream tasks [14]. Among the most anticipated applications is perturbation effect prediction—the ability to forecast how genetic interventions will alter cellular states, a capability with profound implications for drug discovery and functional genomics. However, as investment in these complex models grows, rigorous independent benchmarking has revealed a sobering reality: the promise often exceeds current performance, making critical evaluation non-negotiable for guiding future research and clinical applications [3] [2] [6].

Recent comprehensive benchmark studies have systematically evaluated whether scFMs actually enhance our ability to predict perturbation effects compared to simpler approaches. The consistent finding across multiple independent investigations is that zero-shot scFM embeddings do not provide consistent improvements over deliberately simple baseline models, particularly when predicting strong or atypical perturbation effects or under distribution shift [3] [2]. This revelation underscores the critical importance of standardized evaluation frameworks in an era of increasingly complex AI models for biological discovery.

Experimental Benchmarks: Scrutinizing scFM Performance

Benchmarking Frameworks and Performance Metrics

Independent research teams have developed standardized frameworks to evaluate scFMs for perturbation prediction. PertEval-scFM provides a standardized evaluation framework specifically designed for assessing perturbation effect prediction capabilities [3]. Similarly, a comprehensive benchmark published in Nature Methods compared five foundation models and two other deep learning models against deliberately simple baselines for predicting transcriptome changes after single or double perturbations [2]. These studies employed multiple quantitative metrics to ensure robust assessment, including L2 distance between predicted and observed expression values for highly expressed genes, Pearson delta correlation measures, and specialized metrics for genetic interaction prediction [2]. Additional benchmarking efforts have introduced biology-informed evaluation perspectives, such as the scGraph-OntoRWR metric, which measures the consistency of cell type relationships captured by scFMs with prior biological knowledge [6].

The Double Perturbation Prediction Challenge

A critical test for perturbation prediction models involves forecasting expression changes after dual-gene perturbations. Using the Norman et al. dataset where 100 individual genes and 124 pairs of genes were upregulated in K562 cells, researchers fine-tuned models on all single perturbations and 62 double perturbations, then assessed prediction error on the remaining 62 double perturbations [2]. The results revealed that all deep learning models had substantially higher prediction error (L2 distance for the 1,000 most highly expressed genes) compared to a simple additive baseline that sums individual logarithmic fold changes without using double perturbation data [2]. Furthermore, when predicting genetic interactions—defined as double perturbation phenotypes that differ surprisingly from additive expectations—no model outperformed the "no change" baseline that always predicts control condition expression [2].

Table 1: Performance Comparison in Double Perturbation Prediction

Model Category	Example Models	Prediction Error (L2 Distance)	Genetic Interaction Prediction
Simple Baselines	Additive Model, No Change Model	Lower	Not competitive (Additive) / Baseline performance (No Change)
Specialized DL Models	GEARS, CPA	Higher	Not better than baseline
Single-cell Foundation Models	scGPT, scFoundation, Geneformer	Higher	Not better than baseline

Unseen Perturbation Prediction and Embedding Utility

A claimed advantage of foundation models is their potential to predict effects of completely unseen perturbations using knowledge learned during pretraining. To benchmark this capability, researchers used CRISPR interference datasets from Replogle et al. (K562 and RPE1 cells) and Adamson et al. (K562 cells) [2]. They compared specialized models against simple linear models and an even simpler mean prediction baseline. The findings demonstrated that none of the deep learning models consistently outperformed the mean prediction or linear model [2]. When researchers extracted gene embedding matrices from scFoundation and scGPT and used them in the simple linear model framework, these embeddings performed as well as or better than the original models with their built-in decoders, but did not consistently outperform linear models using embeddings derived directly from training data [2].

Table 2: Unseen Perturbation Prediction Performance

Model Approach	Pearson Correlation with Observed Expression	Consistency Across Cell Lines
Mean Prediction Baseline	Competitive	High
Simple Linear Model	Competitive	High
Foundation Models (scGPT, scFoundation)	Not consistently better than baselines	Variable
Linear Model with scFM Embeddings	Comparable to full scFMs	Variable

Experimental Protocols: Methodologies for Rigorous Evaluation

Standardized Benchmarking Workflow

The benchmarking process follows a standardized workflow to ensure fair comparison across models. The initial phase involves data preparation and partitioning, using publicly available perturbation datasets such as Norman et al. (for double perturbations) or Replogle et al. and Adamson et al. (for unseen perturbation prediction) [2]. For double perturbation experiments, standard practice involves using all single perturbations and a randomly selected half of double perturbations for training, with the remaining double perturbations held out for testing [2]. The next stage involves model fine-tuning and inference, where each model is fine-tuned on the training data according to its recommended settings, then used to generate predictions for the test conditions [2]. Finally, performance quantification calculates metrics like L2 distance or Pearson correlation between predicted and observed expression values, with multiple runs using different random partitions to ensure robustness [2].

Figure 1: Standardized Benchmarking Workflow for scFM Evaluation

Simple Baseline Implementation

The benchmarking studies deliberately include straightforward baselines to contextualize scFM performance. The "no change" model always predicts the same expression as in the control condition, providing a minimal performance threshold [2]. The "additive" model calculates the sum of individual logarithmic fold changes for each gene in a double perturbation without using any double perturbation data for training [2]. For unseen perturbation prediction, a simple linear model represents each read-out gene with a K-dimensional vector and each perturbation with an L-dimensional vector, finding the optimal mapping through least-squares regression [2]. An even simpler mean prediction baseline predicts the average expression across training perturbations for all test conditions [2]. These baselines establish performance expectations that any specialized model should reasonably exceed.

Key Reagents and Computational Tools

Table 3: Essential Research Reagents and Computational Tools for scFM Benchmarking

Resource Name	Type	Function in Research
Norman et al. Dataset	Experimental Data	Provides single and double perturbation data for K562 cells for benchmark validation [2]
Replogle et al. Dataset	Experimental Data	CRISPRi perturbation data in K562 and RPE1 cells for unseen perturbation tests [2]
Adamson et al. Dataset	Experimental Data	Additional perturbation data in K562 cells for benchmark diversity [2]
PertEval-scFM	Software Framework	Standardized evaluation framework for perturbation prediction models [3]
GPerturb	Computational Method	Gaussian process-based model providing competitive alternative to scFMs [12]

Interpretation of Results: Why Simplicity Sometimes Wins

The consistent underperformance of complex scFMs relative to simple baselines demands explanation. Analysis reveals that for many genes, predictions from scGPT, UCE, and scBERT showed minimal variation across different perturbations, resembling the "no change" baseline [2]. Meanwhile, GEARS and scFoundation predictions varied considerably less than the ground truth observations [2]. This suggests that current scFMs may be struggling with representation learning—the core promise of foundation models—failing to capture the nuanced relationships between genes necessary for accurate perturbation prediction [2].

The surprising competitive performance of simple linear models and mean predictions indicates that current scFMs may not be effectively leveraging their pretraining on large single-cell atlases for this specific task [2]. In contrast, pretraining directly on perturbation data—rather than general single-cell atlas data—provided more substantial benefits for prediction accuracy [2]. This suggests that the biological principles necessary for perturbation prediction may not be efficiently transferred from general scFM pretraining, or that the models are prioritizing other aspects of cellular representation during pretraining.

Figure 2: Performance Paradox: Why Simpler Models Can Compete with Complex scFMs

The comprehensive benchmarking of single-cell foundation models for perturbation prediction reveals a critical juncture in the field. While scFMs represent a theoretically promising approach for modeling cellular behavior, current-generation models do not consistently outperform simpler, more interpretable baselines for predicting perturbation effects [3] [2]. This performance gap highlights the immaturity of scFM technology for this specific application and underscores the non-negotiable importance of rigorous benchmarking in directing methodological development.

Future progress will likely require specialized models trained specifically on perturbation data rather than general single-cell atlases, alongside continued development of high-quality datasets capturing a broader range of cellular states [3] [2]. The benchmarking efforts themselves must also evolve, incorporating more biologically meaningful metrics like the scGraph-OntoRWR that assesses consistency with prior biological knowledge [6]. As the field advances, the relationship between model complexity and practical utility must be continually reevaluated, with benchmarking serving as the essential compass guiding development toward models that genuinely enhance our ability to predict and understand cellular responses to perturbation.

The ambitious goal of predicting a cell's transcriptional response to genetic perturbation using single-cell Foundation Models (scFMs) represents a potential frontier in computational biology, with profound implications for rare disease modeling and therapeutic development. The core thesis of current evaluation research, however, reveals a surprising consensus: despite their complexity and computational cost, modern scFMs have not yet consistently surpassed deliberately simple linear baselines in predicting perturbation effects. This guide provides an objective, data-driven comparison of the performance of prominent scFMs against a suite of simpler alternative models, synthesizing evidence from recent rigorous benchmarks to inform researchers and drug development professionals.

Performance Comparison Tables

Table 1: Summary of Model Performance on Key Prediction Tasks

Model	Model Class	Double Perturbation Prediction (L2 Error, Norman et al. data)	Unseen Single Gene Perturbation (Avg. Pearson r)	Key Strengths	Key Limitations
Additive Baseline	Simple Baseline	Lowest Error [2]	N/A	Simple, interpretable, fast	Cannot predict genetic interactions
No Change Baseline	Simple Baseline	Higher than Additive [2]	0.977 (Replogle K562) [2]	Very simple, stable	Biased towards control state
Linear Model	Simple Baseline	N/A	0.979 (Replogle K562) [2]	Simple, can extrapolate	Limited non-linear capacity
scGPT	Foundation Model	Higher than Baselines [2]	0.974 (Replogle K562) [2]	Flexible architecture	High compute, underperforms baselines
Geneformer	Foundation Model	Higher than Baselines [2]	N/A	Context-aware embeddings	Not designed for perturbation prediction
GEARS	Deep Learning	Higher than Baselines [2]	0.969 (Replogle K562) [2]	Incorporates gene graphs	Complex, poor on unseen perturbations
GPerturb	Gaussian Process	N/A	0.981 (Gaussian, Replogle) [12]	Uncertainty estimates, interpretable	Less scalable to huge cell counts
CPA	Deep Learning	Not Competitive [2]	0.984 (mlp, Replogle) [12]	Handles dose-response	Not for unseen perturbations [2]

Genetic Interaction Prediction Performance

Table 2: Performance on Predicting Genetic Interactions (e.g., Synergy, Buffering)

Model	Buffering Interactions (Recall)	Synergistic Interactions (Recall)	Opposite Interactions (Recall)	Overall Accuracy
Additive Baseline	0% (by design)	0% (by design)	0% (by design)	N/A
No Change Baseline	Low [2]	0% (by design) [2]	Low [2]	Not better than random [2]
scGPT	Low [2]	Very Rare [2]	Very Rare [2]	Not better than random [2]
GEARS	Low [2]	Very Rare [2]	Very Rare [2]	Not better than random [2]
scFoundation	Low [2]	Very Rare [2]	Very Rare [2]	Not better than random [2]

Experimental Protocols & Methodologies

Standardized Benchmarking Framework

The pivotal findings presented here stem from standardized benchmarking frameworks like PertEval-scFM, designed to ensure a fair and rigorous comparison between complex scFMs and simple baselines [3]. The core experimental protocol for the double perturbation prediction benchmark, as applied to the Norman et al. dataset, is as follows [2]:

Data Preparation: A dataset of 100 single-gene perturbations and 124 paired double-gene perturbations in K562 cells using CRISPR activation is used. The phenotype is log-transformed RNA-seq expression values for ~19,264 genes.
Training-Test Split: The model is fine-tuned on all 100 single perturbations and a randomly selected 62 of the 124 double perturbations (50%). This process is repeated five times with different random splits for robustness.
Evaluation: Model performance is assessed on the remaining 62 held-out double perturbations. The primary metric is the L2 distance between predicted and observed expression values for the top 1,000 most highly expressed genes.
Baseline Comparison: The performance of all models is compared against two simple baselines:
- The "No Change" Model: Always predicts the same expression as the control, unperturbed condition.
- The "Additive" Model: For a double perturbation A+B, predicts the sum of the individual logarithmic fold changes (LFCs) of perturbation A and perturbation B. Crucially, this baseline does not use any double perturbation data for training.

Unseen Perturbation Prediction Protocol

The benchmark for predicting the effects of completely unseen single-gene perturbations employs datasets from Replogle et al. (K562 and RPE1 cells) and Adamson et al. (K562 cells) [2]. The protocol is:

Models are trained on a set of perturbations and then tasked with predicting the transcriptomic outcome for a set of entirely held-out perturbations.
A simple linear baseline is constructed by representing each read-out gene and each perturbation with low-dimensional vectors (embeddings). A linear mapping (matrix W) is learned to minimize the difference between predicted and observed expression in the training data.
An even simpler "mean prediction" baseline is also included, which always predicts the average expression across all training perturbations.
The predictive performance is measured using the Pearson correlation between the predicted and observed expression levels for the held-out perturbations.

Model Architectures and Workflows

Benchmarking Experimental Workflow

Figure 1: Standardized Benchmarking Workflow. The process involves splitting perturbation data, training diverse model classes, and evaluating performance to generate a final ranking.

Architectural Comparison

Figure 2: Model Architecture Paradigms. Contrasts the complex fine-tuning of scFMs with the simpler linear mapping and the interpretable, sparse structure of GPerturb.

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Research Reagents and Computational Tools for Perturbation Modeling

Reagent / Tool	Function / Description	Example Use in Benchmarking
CRISPR Activation (CRISPRa)	Targeted genetic perturbation to upregulate gene expression.	Generating single and double perturbation data in K562 cells for model training and testing [2].
Perturb-seq / CROP-seq	Single-cell RNA sequencing combined with CRISPR screening.	Provides high-throughput, single-cell resolution readouts of perturbation effects [12].
K562 Cell Line	A chronic myelogenous leukemia cell line.	A standard, widely used model system for perturbation screens (e.g., in Norman, Replogle datasets) [2].
RPE1 Cell Line	A retinal pigment epithelial cell line.	Used to test model generalizability across different cellular contexts [2].
Linear Regression Model	A simple statistical model for predicting a continuous outcome.	Serves as a powerful and hard-to-beat baseline for predicting perturbation effects [2].
Gaussian Process (GP) Regression	A non-parametric Bayesian modeling technique.	Used in GPerturb to provide uncertainty estimates alongside predictions [12].
Gene Ontology (GO) Annotations	A structured knowledge base of gene functions.	Used by some models (e.g., GEARS) to inform relationships between genes for predicting unseen perturbations [2].

Inside the Models: Architectures, Baselines, and the Closed-Loop Breakthrough

The prediction of perturbation effects, a cornerstone of functional genomics and therapeutic development, demands computational models capable of interpreting the complex, interlinked nature of biological systems. In this domain, model architectures are not merely technical choices but fundamental determinants of what biological phenomena can be captured. The evaluation of these models, particularly through frameworks like scFM (single-cell Foundation Models), reveals critical trade-offs between their ability to generalize to unseen perturbations and their susceptibility to confounding by systematic variation.

This guide provides a structured comparison of predominant architectural families—from various autoencoder formulations to sophisticated graph networks—deconstructing their performance, experimental protocols, and implementation requirements within perturbation effect prediction research. As benchmark studies reveal, even sophisticated models often perform comparably to simple baselines that capture average treatment effects when evaluated using conventional metrics, highlighting the critical need for rigorous, bias-aware evaluation frameworks like Systema [15]. Understanding these architectural nuances is essential for researchers and drug development professionals selecting appropriate models for specific perturbation prediction tasks.

Architectural Taxonomy and Performance Benchmarking

Quantitative Performance Comparison of Model Architectures

Table 1: Performance comparison of key architecture families across benchmark tasks.

Architecture Family	Representative Model(s)	Primary Application Domain	Key Performance Metrics	Reported Performance	Notable Strengths	Key Limitations
Graph Convolutional Networks	PLGNN [16], GCN [17]	Node classification, Graph classification	Accuracy	Avg. 2.6% improvement on node classification; Avg. 2.1% on graph classification vs. SOTA [16]	Adaptive feature aggregation, Robustness to missing information	Limited higher-level semantic extraction in shallow implementations
Autoencoder-Graph Hybrids	scCAGN [18], DDGAE [19]	scRNA-seq clustering, Drug-target interaction prediction	Normalized Mutual Information (NMI), AUC, AUPR	NMI: 0.9732 (QS_diaphragm) [18]; AUC: 0.9600, AUPR: 0.6621 [19]	Dynamic feature fusion, Superior representation learning	Computational complexity, Integration challenges
Stacked Autoencoders with Optimization	optSAE+HSAPSO [20]	Drug classification, Target identification	Accuracy, Computational efficiency	Accuracy: 95.52%; Time: 0.010s/sample [20]	High predictive accuracy, Rapid processing	Dependent on training data quality, Hyperparameter sensitivity
Simple Baselines	Perturbed Mean, Matching Mean [15]	Perturbation response prediction	PearsonΔ, PearsonΔ20	Comparable or superior to SOTA methods across 10 datasets [15]	Computational simplicity, Resistance to overfitting	Limited capture of perturbation-specific effects
Dynamic Weighting Graph Networks	DWR-GCN (within DDGAE) [19]	Drug-target interaction prediction	AUC, AUPR	Enhances representation capability without over-smoothing [19]	Increased network depth, Mitigated over-smoothing	Implementation complexity

Perturbation-Specific Predictive Performance

Table 2: Performance on perturbation response prediction tasks (adapted from Systema benchmarking [15]).

Model Architecture	Adamson Dataset (PearsonΔ)	Norman Dataset (PearsonΔ)	Replogle RPE1 Dataset (PearsonΔ)	Generalization to Unseen Perturbations
Perturbed Mean Baseline	High	High	High	Limited to average effects
Matching Mean Baseline	Not applicable	Highest	Not applicable	Good for combinatorial perturbations
CPA	Moderate	Moderate	Moderate	Limited by design
GEARS	Moderate	Moderate-High	Moderate	Moderate for one-gene perturbations
scGPT	Moderate	Moderate	Moderate	Varies by dataset

Experimental Protocols and Methodologies

Standardized Evaluation Frameworks for Perturbation Prediction

The Systema framework establishes rigorous protocols for evaluating perturbation response prediction methods, emphasizing the need to control for systematic variation—consistent differences between perturbed and control cells arising from selection biases or biological confounders [15]. Standard metrics like Pearson correlation of expression changes (PearsonΔ) and Pearson correlation of top 20 differentially expressed genes (PearsonΔ20) are susceptible to these biases, potentially overestimating model performance.

Proper experimental evaluation should include:

Systematic Variation Quantification: Measure the degree of systematic differences between perturbed and control cells across datasets, which can be influenced by factors like targeted biological processes or cell cycle distribution shifts [15].
Perturbation-Specific Effect Isolation: Focus evaluation on a model's ability to reconstruct the perturbation landscape and identify functionally coherent gene groups, rather than merely capturing average treatment effects.
Combinatorial Perturbation Testing: Assess generalization to unseen multi-gene perturbations, where performance typically improves with the number of matching one-gene perturbations seen during training [15].

Model-Specific Training Methodologies

Graph Neural Networks with Adaptive Perturbation (PLGNN)

The PLGNN framework employs two key strategies to address missing information in graph data:

High-way Links Strategy: Constructs connections using higher-order neighbors sharing the same class label to augment the graph and enable more efficient feature aggregation from distant nodes [16].
Adaptive Feature Perturbation: Adds noise to real features followed by adaptive training to bolster model robustness and reduce overfitting risks associated with missing information [16].

The training minimizes a combined loss function incorporating both supervised classification objectives and regularization terms from the feature perturbation process.

Hybrid Adversarial Autoencoder-Graph Networks (scCAGN)

The scCAGN methodology for single-cell RNA sequencing clustering integrates three components through a joint training mechanism:

Adversarial Autoencoder (AAE): Comprises an encoder, decoder, and discriminator. The encoder and decoder minimize reconstruction loss ((L{res} = \frac{1}{2N}\sum{i=1}^{N}\|\bar{X} - \hat{X}\|_{F}^{2})), while the discriminator guides the encoder to align latent space with a prior distribution (e.g., normal Gaussian) through adversarial loss [18].
Graph Convolutional Network (GCN): Processes cell similarity graphs constructed using K-nearest neighbors to capture topological relationships between cells via layer-wise propagation: (Z^{(l)} = \phi(\tilde{D}^{-\frac{1}{2}}(A+I)\tilde{D}^{-\frac{1}{2}}Z^{(l-1)}W^{(l-1)})) [18].
Dynamic Fusion Mechanism: Integrates representations from AAE and GCN using cross-attention mechanisms, followed by dual-constraint clustering to optimize cell-type annotations [18].

Optimized Stacked Autoencoders with Evolutionary Optimization (optSAE+HSAPSO)

This drug classification framework operates in two phases:

Stacked Autoencoder (SAE) Pretraining: Multiple autoencoder layers learn hierarchical feature representations through unsupervised reconstruction pre-training [20].
Hierarchically Self-Adaptive Particle Swarm Optimization (HSAPSO): An evolutionary algorithm dynamically adapts hyperparameters during training, optimizing the trade-off between exploration and exploitation without relying on gradient information [20].

The joint optimization enables the model to achieve high accuracy while reducing computational overhead compared to traditional deep learning approaches.

Architectural Visualizations and Workflows

Graph Autoencoder Architecture for Perturbation Modeling

Graph Autoencoder Architecture: This diagram illustrates the encoder-decoder structure common to many perturbation prediction models, where input data is compressed into a latent representation before reconstruction.

scCAGN Integrated Workflow for Single-Cell Analysis

scCAGN Integrated Workflow: This workflow shows the parallel processing of single-cell data through both autoencoder and graph network pathways before dynamic fusion and clustering.

Systematic Variation in Perturbation Datasets

Systematic Variation Sources: This diagram decomposes observed transcriptional responses into perturbation-specific effects and systematic confounders that can bias model evaluation.

Essential Research Reagents and Computational Tools

Table 3: Key research reagents and computational tools for perturbation modeling experiments.

Resource Category	Specific Resource	Application Context	Key Functionality
Benchmark Datasets	Adamson (2016) [15], Norman (2019) [15], Replogle (2022) [15]	Perturbation response prediction	Provide standardized benchmarking across technologies and cell lines
Evaluation Frameworks	Systema [15]	Model evaluation	Quantifies systematic variation and enables bias-aware performance assessment
Biological Networks	DrugBank [19], HPRD [19], STRING	Drug-target interaction, Protein-protein interaction	Source of prior biological knowledge for network-based models
Molecular Descriptors	Molecular fingerprints [21], Chemical Checker signatures [21]	Drug representation in synergy prediction	Encodes chemical structure information for machine learning
Graph Learning Libraries	PyTorch Geometric [22]	GNN implementation	Provides efficient graph neural network operations and pre-processing
Single-Cell Analysis Tools	Seurat [18], Scanpy	scRNA-seq preprocessing	Quality control, normalization, and feature selection for single-cell data

A transformative shift is underway in computational biology, where the prediction of cellular responses to genetic perturbations is foundational for understanding disease mechanisms and identifying therapeutic targets. The advent of single-cell foundation models (scFMs), pretrained on millions of single-cell transcriptomes, promised to learn general principles of cellular biology that could accurately predict transcriptional outcomes of unseen genetic perturbations. However, recent rigorous benchmarking studies reveal a surprising counter-narrative: deliberately simple baseline models, including linear models and the "perturbed mean" approach, consistently match or surpass the performance of these complex deep-learning architectures [2] [23] [15]. This comparison guide synthesizes evidence from multiple systematic evaluations to objectively assess the performance landscape of perturbation prediction methods, providing researchers with evidence-based guidance for method selection and highlighting critical considerations for robust model evaluation.

Performance Benchmarking: Quantitative Comparisons Across Methods

Recent comprehensive benchmarks across multiple datasets and cell lines demonstrate that simple baselines achieve competitive performance compared to state-of-the-art foundation models.

Table 1: Performance Comparison of Perturbation Prediction Methods (PearsonΔ Metric)

Method	Adamson Dataset	Norman Dataset	Replogle K562	Replogle RPE1
Perturbed Mean	0.711	0.557	0.373	0.628
scGPT	0.641	0.554	0.327	0.596
scFoundation	0.552	0.459	0.269	0.471
Random Forest + GO Features	0.739	0.586	0.480	0.648

Source: Adapted from benchmarking results [23]

The perturbed mean baseline, which simply predicts the average expression across all perturbed cells in the training data, consistently outperforms both scGPT and scFoundation across all datasets [23] [15]. Similarly, for predicting combinatorial perturbation effects in the Norman dataset, the matching mean baseline (averaging centroids of individual perturbations) outperformed specialized deep learning models by an 11% margin compared to the best alternative method (GEARS) [15].

Performance on Genetic Interaction Prediction

The benchmark extended to evaluating models' ability to predict genetic interactions—instances where simultaneous perturbations produce unexpected effects compared to individual perturbations. Using data where 100 individual genes and 124 pairs of genes were upregulated in K562 cells, researchers assessed how well models could predict these non-additive effects [2]. Surprisingly, none of the foundation models (scGPT, scFoundation, GEARS, CPA) outperformed the simplistic "no change" baseline that always predicts expression identical to control conditions [2]. All models predominantly predicted buffering interactions and rarely correctly identified synergistic interactions, revealing a significant limitation in current approaches for capturing complex genetic interplay.

Experimental Protocols and Benchmarking Methodologies

Benchmarking Framework Design

The consistent underperformance of complex models across studies raises critical questions about evaluation methodologies. Key benchmarking frameworks include:

PertEval-scFM: A standardized framework for evaluating zero-shot single-cell foundation model embeddings against baseline models, specifically designed to assess whether contextualized representations enhance perturbation effect prediction [13] [3].
Systema: An evaluation framework that emphasizes perturbation-specific effects and identifies predictions that correctly reconstruct the perturbation landscape, specifically addressing systematic variation biases [15].

These frameworks employ rigorous cross-validation strategies, typically fine-tuning models on a subset of perturbations and assessing prediction error on held-out perturbations across multiple random partitions to ensure robustness [2].

Evaluation Metrics and Their Limitations

Benchmarks employ multiple metrics to comprehensively assess model performance:

Pearson Delta: Correlation between predicted and observed differential expression profiles [23] [15]
Pearson Delta20: Focused evaluation on top 20 differentially expressed genes [23]
L2 Distance: Measurement of expression prediction error [2]
Genetic Interaction Prediction: Ability to identify non-additive perturbation effects [2]

Recent research reveals that standard metrics are susceptible to systematic variation—consistent transcriptional differences between perturbed and control cells arising from selection biases or confounders [15]. This systematic variation can lead to overestimated performance for methods that primarily capture average perturbation effects rather than perturbation-specific biology.

Diagram 1: Systematic Variation in Perturbation Datasets Affects Benchmarking

The Scientist's Toolkit: Essential Research Reagents and Solutions

Table 2: Key Research Reagents and Experimental Resources

Resource	Type	Function in Perturbation Studies
CRISPR Activation (CRISPRa)	Perturbation Technology	Gene overexpression in perturbation screens [2]
CRISPR Interference (CRISPRi)	Perturbation Technology	Gene knockdown in perturbation screens [23]
Perturb-seq	Screening Technology	Combines CRISPR perturbations with single-cell sequencing [23]
Gene Ontology (GO) Annotations	Biological Database	Provides functional gene annotations for feature engineering [2] [23]
Adamson Dataset	Experimental Data	CRISPRi perturbation data with 68,603 single cells [23]
Norman Dataset	Experimental Data	Single/double CRISPRa perturbations with 91,205 single cells [2] [23]
Replogle Dataset	Experimental Data	Genome-wide CRISPRi screen with ~162,750 cells per cell line [23]

Understanding the Performance Gap: Systematic Variation and Its Implications

The Systematic Variation Challenge

The surprisingly strong performance of simple baselines can be largely explained by systematic variation in perturbation datasets—consistent differences between perturbed and control cells that arise from selection biases, confounding variables, or underlying biological factors [15]. For example:

In the Norman dataset, perturbations target genes involved in specific biological processes (cell cycle and growth), creating structured variation that simple means can capture [15].
In the Replogle RPE1 dataset, significant differences in cell-cycle distribution exist between perturbed and control cells (46% of perturbed cells vs. 25% of control cells in G1 phase) due to widespread chromosomal instability-induced cell-cycle arrest [15].
Gene set enrichment analysis reveals consistent pathway activation differences (e.g., cellular death activation, stress response downregulation) between perturbed and control populations across datasets [15].

Diagram 2: Why Simple Baselines Succeed in Current Benchmarks

Limitations of Current Foundation Models

Beyond systematic variation issues, several inherent limitations contribute to the underperformance of foundation models:

Ineffective representation learning: When pretrained gene embeddings from scGPT and scFoundation were used in simple random forest models instead of their native architectures, performance improved, suggesting the foundation models' pretraining captures useful information but their complex decoders fail to utilize it effectively [23].
Limited generalizability: Models struggle with predicting strong or atypical perturbation effects and perform particularly poorly under distribution shift [13] [3].
Computational inefficiency: Significant computational expenses for fine-tuning deep learning models yield no performance benefits over simpler approaches [2].

The consistent evidence across multiple rigorous benchmarks indicates that current deep-learning-based foundation models for perturbation effect prediction do not yet provide substantial advantages over deliberately simple linear baselines and mean-based approaches. This conclusion holds across diverse experimental datasets, perturbation types, and evaluation metrics [2] [23] [15].

For researchers and drug development professionals, these findings suggest:

Method selection should prioritize simple baselines as competitive starting points for perturbation prediction tasks.
Evaluation rigor must increase, with attention to systematic variation and use of frameworks like Systema that disentangle true predictive performance from dataset biases.
Future development should focus on creating more diverse perturbation datasets that capture broader cellular states and specialized architectures that better leverage biological prior knowledge.

The field stands to benefit from increased focus on performance metrics and benchmarking standards that will facilitate genuine progress toward the goal of generalizable predictive models in computational biology [2]. As benchmarking methodologies become more sophisticated and datasets more comprehensive, the true potential of both simple and complex approaches can be properly assessed and harnessed for biological discovery and therapeutic development.

Predicting how individual cells respond to genetic or chemical perturbations represents a fundamental challenge in computational biology with significant implications for understanding disease mechanisms and therapeutic development [24]. The emergence of single-cell RNA sequencing (scRNA-seq) and CRISPR screening technologies has generated unprecedented volumes of high-resolution data, creating both opportunities and challenges for computational method development [24]. In this landscape, two competing approaches have emerged: complex deep learning models, including single-cell foundation models (scFMs), and simpler, often classically-inspired statistical methods.

Recent benchmarking studies have revealed a surprising trend: sophisticated models often fail to outperform simple baselines. The PertEval-scFM benchmark demonstrated that zero-shot scFM embeddings provide no consistent improvement over simpler baseline models, particularly under distribution shift [3] [13] [25]. Similarly, an independent 2025 benchmarking study found that even the simplest baseline model—taking the mean of training examples—outperformed foundation models scGPT and scFoundation [23]. These findings highlight a critical need for interpretable, robust methods that can genuinely capture biological mechanisms rather than merely learning systematic biases in training data.

Within this context, GPerturb emerges as a novel Gaussian process-based approach that balances predictive performance with interpretability and uncertainty quantification [24]. This case study examines GPerturb's methodological framework, benchmarking results, and practical utility for researchers and drug development professionals.

GPerturb's Methodological Framework: A Gaussian Process Approach

Core Architecture and Theoretical Foundations

GPerturb is a hierarchical Bayesian model designed specifically for estimating sparse, interpretable gene-level perturbation effects from single-cell CRISPR screening data [24]. Unlike "black box" deep learning approaches, GPerturb employs a transparent generative modeling structure that separates biological signal from technical noise through two distinct components:

Basal Expression Component: A feature-specific basal expression level determined by cell-specific parameters (e.g., cell type, sequencing information)
Perturbation Effect Component: A feature-specific perturbation effect dependent on the perturbation type, controlled by a binary on/off switch to enforce sparsity [24]

The model employs Gaussian processes (GPs) to capture nonlinear relationships mapping cell-specific parameters and perturbation types to observed expression levels [24]. This nonparametric Bayesian approach provides natural uncertainty estimates for both the presence and strength of perturbation effects on individual genes, a critical feature for reliable biological discovery.

Figure 1: GPerturb's architectural framework separates basal expression from perturbation effects using Gaussian process priors and sparsity constraints.

Implementation Variants and Data Compatibility

A key practical advantage of GPerturb is its flexibility in handling different data types, which are common points of friction in single-cell analysis:

GPerturb-Gaussian: Designed for continuous transformed expression measurements
GPerturb-ZIP: Employs a zero-inflated Poisson model for raw count data [24]

This dual formulation allows researchers to apply GPerturb regardless of their preprocessing pipeline, eliminating the need for potentially distorting data transformations required by other methods.

Comparative Performance Benchmarking

Predictive Accuracy on Single-Gene Perturbations

GPerturb's performance has been rigorously evaluated against leading perturbation prediction methods across multiple datasets. The following table summarizes its performance in predicting single-gene perturbation effects compared to state-of-the-art approaches:

Table 1: Performance comparison on single-gene perturbation prediction from a genome-wide CRISPRi Perturb-seq dataset

Method	Input Data Type	Pearson Correlation	Key Limitations
GPerturb-Gaussian	Continuous	0.981	Slightly lower than CPA-mlp
CPA-mlp	Continuous	0.984	Requires categorical cell information
GEARS	Continuous	0.977	Limited to discrete perturbations
GPerturb-ZIP	Count-based	0.972	-
SAMS-VAE	Count-based	0.944	Cannot incorporate cell-level information

Data adapted from GPerturb benchmark studies [24]

In these head-to-head comparisons, GPerturb demonstrated competitive performance across different data modalities. GPerturb-Gaussian nearly matched the performance of CPA-mlp (0.981 vs. 0.984) while offering superior interpretability, while GPerturb-ZIP substantially outperformed SAMS-VAE on count-based data (0.972 vs. 0.944) [24].

Directionality Agreement and Effect Consistency

Beyond overall correlation, the directionality of predicted perturbation effects (whether a perturbation increases or decreases gene expression) represents a critical metric for biological utility. GPerturb demonstrates notable advantages in this domain:

Table 2: Directionality agreement between methods for perturbation effect predictions

Comparison Pair	Directionality Agreement	Key Discrepancies
GPerturb-Gaussian vs. CPA	Moderate	Exosome-related perturbation effects
GPerturb-Gaussian vs. GEARS	Moderate	Exosome-related perturbation effects
GPerturb-ZIP vs. SAMS-VAE	High	Minimal
All methods consensus	Low	Only 21 genes shared across methods

Data synthesized from benchmark analyses [24]

These discrepancies highlight a concerning lack of consensus in the field, with different methods frequently predicting opposite effects for the same gene-perturbation pairs [24]. GPerturb's higher consistency with SAMS-VAE on count-based data suggests its perturbations effects may be more biologically reliable.

Performance in the Context of Simple Baselines and Systematic Variation

Recent research has revealed that systematic variation—consistent differences between perturbed and control cells arising from selection biases or confounders—can lead to overoptimistic performance assessments [15]. The Systema evaluation framework has demonstrated that simple baselines like "perturbed mean" (averaging expression across all perturbed cells) often match or exceed the performance of sophisticated models including CPA, GEARS, and scGPT [15].

In this challenging evaluation context, GPerturb's competitive performance using a principled statistical framework with inherent uncertainty quantification represents a significant advantage over both deep learning approaches and simplistic baselines.

Experimental Protocols and Validation Frameworks

Standardized Benchmarking Methodology

To ensure fair comparison across perturbation prediction methods, recent benchmarks have adopted standardized evaluation protocols:

Data Splitting: 80/20 train-test splits, with strict separation of perturbations to evaluate generalization to unseen perturbations [24]
Evaluation Metrics: Pearson correlation between predicted and observed expression levels, calculated both in raw expression space and differential expression space [24] [23]
Pseudo-bulk Formation: Single-cell predictions are aggregated to perturbation-level pseudo-bulk profiles for stable comparison [23]
Directionality Analysis: Assessment of concordance in sign (up/down-regulation) of effects across methods [24]

The move toward more rigorous benchmarks like PertEval-scFM [3] [13] and Systema [15] addresses longstanding concerns about overoptimistic evaluations in the field.

Uncertainty Quantification in Experimental Design

A distinctive advantage of GPerturb's Bayesian framework is its native uncertainty quantification. The model provides variance estimates for both basal expression levels and perturbation effects, allowing researchers to:

Prioritize high-confidence predictions for experimental validation
Identify ambiguous predictions requiring additional data
Distinguish strong, consistent effects from weak or variable ones [24]

This capability is particularly valuable for designing efficient perturbation screens, as it helps focus experimental resources on the most reliable predictions.

Research Reagent Solutions: Practical Implementation Toolkit

Implementing perturbation prediction methods requires specific computational tools and data resources. The following table outlines key components of the GPerturb research toolkit:

Table 3: Essential research reagents and computational tools for perturbation prediction studies

Resource Category	Specific Examples	Function and Application
Perturbation Datasets	Adamson (2016), Norman (2019), Replogle (2022)	Benchmark data for training and evaluation [24] [23] [15]
Software Frameworks	GPerturb, CPA, GEARS, scGPT	Core prediction algorithms with distinct methodological approaches [24] [23]
Evaluation Frameworks	PertEval-scFM, Systema	Standardized benchmarking tools [3] [13] [15]
Baseline Methods	Perturbed Mean, Matching Mean	Simple comparators for performance validation [15]
Visualization Tools	AUCell, GSEA plots	Biological interpretation of predicted effects [15]

Signaling Pathways and Biological Workflows

The integration of perturbation prediction into biological discovery follows a structured workflow that combines computational and experimental approaches:

Figure 2: The iterative closed-loop framework for perturbation discovery, combining computational prediction with experimental validation.

The emerging approach of "closed-loop" perturbation modeling demonstrates how experimental results can be continuously incorporated to refine predictions. Recent work shows that incorporating even small numbers of experimental perturbation examples (10-20) during fine-tuning can dramatically improve prediction accuracy [26]. This iterative approach tripled positive predictive value in T-cell activation studies, from 3% to 9%, while also improving sensitivity and specificity [26].

GPerturb represents a compelling alternative to both complex foundation models and oversimplified baselines in the perturbation prediction landscape. Its Gaussian process framework provides:

Competitive predictive performance across multiple data modalities and benchmark datasets
Interpretable effect estimates with built-in sparsity constraints
Native uncertainty quantification for principled experimental prioritization
Flexible implementation accommodating both continuous and count-based data

For researchers and drug development professionals, GPerturb offers a balanced solution that bridges the gap between black-box deep learning models and biologically implausible oversimplifications. As the field moves toward more rigorous evaluation standards that account for systematic biases [15], GPerturb's principled statistical foundation positions it as a valuable tool for therapeutic target discovery and mechanistic studies.

The ongoing development of closed-loop frameworks [26] that incorporate experimental feedback into model refinement points toward a future where computational predictions and experimental validation are tightly integrated, accelerating biological discovery and therapeutic development.

The 'closed-loop' paradigm represents a fundamental shift in scientific methodology, transitioning from traditional linear, open-loop approaches to dynamic, feedback-driven experimentation. In classical open-loop systems, experiments follow a predetermined "stimulate → record response" protocol, treating biological systems as black boxes [27]. In contrast, closed-loop neuroscience and related fields respect the inherent "loopiness" of neural circuits and the fact that the nervous system is embodied and embedded in an environment [27]. This paradigm has become increasingly feasible thanks to advances in real-time processing of large data streams, enabled by improvements in computer processing power, electronics such as microprocessors and field-programmable gate arrays (FPGAs), and specialized software [27].

In the specific context of perturbation effect prediction, this closed-loop approach enables researchers to continuously refine their models based on experimental outcomes, creating an iterative cycle of prediction, experimental validation, and model improvement. This is particularly relevant for single-cell foundation models (scFMs) that aim to predict transcriptional responses to genetic perturbations, where the ultimate goal is to develop models that can accurately forecast the effects of genetic interventions without requiring exhaustive wet-lab experimentation [28] [2].

Benchmarking Single-Cell Foundation Models for Perturbation Prediction

Current Performance Landscape

Recent systematic benchmarking efforts reveal significant limitations in current deep-learning-based approaches for predicting genetic perturbation effects. The PertEval-scFM framework provides a standardized evaluation methodology that assesses whether contextualized representations from single-cell foundation models enhance perturbation effect prediction in a zero-shot setting [28]. Surprisingly, these benchmarks demonstrate that scFM embeddings offer limited improvement over simple baseline models, particularly under distribution shift [28].

A comprehensive study published in Nature Methods compared five foundation models (scGPT, scFoundation, scBERT, Geneformer, UCE) and two other deep learning models (GEARS, CPA) against deliberately simple baselines for predicting transcriptome changes after single or double perturbations [2]. The results were striking: none of the sophisticated deep learning models outperformed the simple baselines, highlighting the importance of critical benchmarking in directing and evaluating method development [2].

Quantitative Performance Comparison

Table 1: Performance Comparison of Perturbation Prediction Models on Double Perturbation Tasks

Model Type	Model Name	Prediction Error (L2 Distance)	Genetic Interaction Prediction Accuracy	Computational Requirements
Simple Baselines	Additive Model	Lowest	Limited to additive effects only	Minimal
	No Change Model	Moderate	Cannot predict synergistic interactions	Minimal
Foundation Models	scGPT	Higher than baselines	Poor, mostly predicts buffering interactions	Very High
	scFoundation	Higher than baselines	Poor, limited variation across perturbations	Very High
	GEARS	Higher than baselines	Moderate, but less variable than ground truth	High
Other Deep Learning	CPA	Not designed for this task	Not applicable	High

Table 2: Performance on Unseen Perturbation Prediction

Model Type	Training Data Strategy	Performance on Unseen Perturbations	Consistency Across Cell Lines
Linear Model with Pretrained P	Perturbation data from related cell lines	Best performing	More accurate for similar genes between cell lines
Foundation Model Embeddings	Single-cell atlas data	Small benefit over random embeddings	Variable
Mean Prediction	None (always predicts average)	Moderate	Consistent but inaccurate
GEARS	Gene Ontology annotations	Poor	Not consistent
scGPT/scFoundation	Model's pretrained embeddings	Poor to moderate	Variable

The benchmarking analysis revealed that all deep learning models had substantially higher prediction errors compared to the simple additive baseline when predicting double perturbation effects [2]. The evaluation metric used was the L2 distance between predicted and observed expression values for the 1,000 most highly expressed genes. This finding persisted across different summary statistics, including Pearson delta measure and L2 distances for various gene subsets [2].

For genetic interaction prediction, conceptualized as double perturbation phenotypes that differ surprisingly from additive expectations, none of the models outperformed the 'no change' baseline [2]. The models were particularly deficient in predicting synergistic interactions, with most models predominantly predicting buffering interactions and rarely correctly identifying synergistic relationships [2].

Experimental Protocols and Methodologies

Benchmarking Experimental Design

The benchmarking protocols for perturbation effect prediction involve several critical methodological components. For double perturbation assessment, researchers used data where 100 individual genes and 124 pairs of genes were upregulated in K562 cells using a CRISPR activation system [2]. The phenotypes for these 224 perturbations, plus a no-perturbation control, are logarithm-transformed RNA sequencing expression values for 19,264 genes [2].

The standard experimental workflow involves fine-tuning models on all 100 single perturbations and a subset of double perturbations (62 of 124), then assessing prediction error on the remaining held-out double perturbations [2]. For robustness, researchers typically run each analysis multiple times (e.g., five repetitions) using different random partitions of the data [2].

The key evaluation metrics include:

L2 distance between predicted and observed expression values
Pearson delta measure for correlation assessment
Genetic interaction detection capability at specified false discovery rates (typically 5%)
Classification accuracy for interaction types (buffering, synergistic, opposite)

For unseen perturbation prediction, benchmarks utilize CRISPR interference datasets from multiple cell lines (K562 and RPE1) [2]. The simple linear baseline model represents each read-out gene with a K-dimensional vector and each perturbation with an L-dimensional vector, with these vectors collected in matrices G and P respectively [2]. The model then solves the optimization problem: min‖Ytrain - (GWP^T + b)‖₂², where b is the vector of row means of Ytrain [2].

Data Processing and Quality Control

Proper experimental design requires rigorous data quality control procedures. Researchers must:

Remove incomplete cases - exclude respondents/measurements that did not complete the survey/experiment [29]
Filter test responses - eliminate answers given in "Preview" or "Test" mode [29]
Address missing data - identify and understand causes of missing values in treatment or outcome variables [29]
Implement attention checks - exclude data points that fail predefined quality thresholds [29]
Handle outliers - flag and exclude measurements beyond specified standard deviation thresholds [29]

Additionally, for perturbation experiments, it's crucial to verify the integrity of randomization through statistical tests like two-sample independent t-tests for continuous variables and Chi-square tests for categorical variables [29].

Diagram 1: Closed-Loop Experimental Workflow for Perturbation Prediction. This illustrates the iterative process of generating perturbation data, training models, and refining predictions.

Essential Research Reagents and Computational Tools

Laboratory Reagents and Experimental Materials

Table 3: Essential Research Reagents for Perturbation Experiments

Reagent/Tool	Specification	Experimental Function
CRISPR Activation System	As used by Norman et al. (e.g., CRISPRa)	Introduction of targeted genetic perturbations in cell lines
K562 Cells	Human immortalized myelogenous leukemia line	Primary model system for perturbation studies
RPE1 Cells	Human retinal pigment epithelial cell line	Alternative model system for validation
RNA Sequencing Reagents	High-throughput sequencing platforms	Transcriptomic profiling of perturbation effects
FCF Brilliant Blue	Sigma Aldrich, 9.5mg dye in 100mL distilled water	Spectrophotometric standardization and quantification
Pasco Spectrometer	With cuvettes for absorbance measurement	Quantitative measurement of chemical concentrations

Computational and Analytical Tools

The computational toolkit for perturbation prediction research includes both specialized and general-purpose analytical frameworks:

Specialized Benchmarking Frameworks:

PertEval-scFM - Standardized framework for evaluating single-cell foundation models for perturbation effect prediction [28]
GEARS - Genetic perturbation modeling framework that uses Gene Ontology annotations for extrapolation [2]
scGPT/scFoundation - Foundation models trained on single-cell transcriptomics data [2]

Data Analysis Environments:

R Statistical Programming with packages including:
- tidyverse and data.table for data wrangling and reshaping [29]
- ggplot2 for visualization [29]
- QuantPsyc for generating statistical results [29]
- grf for generalized random forests [29]
Python with visualization libraries like Matplotlib [30]

Statistical Analysis Tools:

t-test formulations for comparing experimental means, with two-sample tests assuming equal or unequal variances [31]
F-test for comparing variances between datasets before conducting t-tests [31]
Hypothesis testing with standard significance levels (α = 0.05, 0.01, or 0.001) [31]

Conceptual Framework: Understanding Closed-Loop System Architectures

The closed-loop paradigm in neuroscience and perturbation research encompasses several distinct conceptual frameworks, each with specific characteristics and applications:

Diagram 2: Architectures of Open and Closed-Loop Experimental Systems. This compares traditional open-loop approaches with two types of closed-loop systems used in neuroscience and perturbation research.

In the context of perturbation prediction, the "brain-state dynamics loop" corresponds to how models are updated based on newly observed transcriptional states, while the "task dynamics loop" represents the broader experimental context where predictions inform subsequent perturbation designs [32]. This conceptual framework is crucial for understanding how closed-loop systems differ from traditional open-loop approaches where the stimulus protocol is predetermined by the experimenter without regard to the system's current state [32].

Future Directions and Methodological Implications

The benchmarking results showing that simple linear models can outperform sophisticated foundation models have significant implications for the field of perturbation prediction [28] [2]. These findings underscore the importance of rigorous benchmarking and the need for specialized models and high-quality datasets that capture a broader range of cellular states [28].

Future methodological developments should focus on:

Improved Training Data - Incorporating more diverse perturbation data across multiple cell types and conditions
Better Representation Learning - Developing embeddings that more accurately capture gene-gene interactions and pathway relationships
Incorporation of Biological Priors - Leveraging existing knowledge about gene networks and functional annotations
Iterative Model Refinement - Implementing true closed-loop systems where model predictions directly inform subsequent experimental designs

The closed-loop paradigm represents a promising framework for addressing the current limitations in perturbation effect prediction. By creating iterative cycles of prediction, experimental validation, and model refinement, researchers can gradually improve the accuracy and generalizability of their models, potentially overcoming the current performance plateau where complex models fail to outperform simple baselines [2].

As the field progresses, the integration of more sophisticated closed-loop approaches with increasingly comprehensive experimental data holds the potential to eventually realize the goal of accurate in silico prediction of genetic perturbation effects, which would dramatically accelerate basic research and therapeutic development.

Why Models Stumble: Systematic Variation and Paths to Improvement

In the field of single-cell biology, accurately predicting how cells respond to genetic perturbations is fundamental to advancing functional genomics, drug discovery, and therapeutic development. However, a formidable challenge confounds these efforts: systematic variation. This term refers to consistent, non-biological differences in gene expression profiles that arise from experimental artifacts, selection biases, or confounding biological factors, rather than from the specific perturbation being studied [15]. These biases can range from stress responses induced during tissue dissociation for single-cell analysis to pervasive confounders like cell cycle distribution shifts [33] [15]. When unaccounted for, systematic variation skews data interpretation, leading to overoptimistic performance claims for prediction models and potentially misleading biological conclusions. This guide objectively evaluates the current landscape of single-cell perturbation effect prediction, highlighting how systematic variation confounds model performance and comparing the capabilities of various computational approaches against simple, yet robust, baselines.

Defining Systematic Variation in Single-Cell Data

Systematic variation differs fundamentally from random noise. Random error causes unpredictable fluctuations in measurements that tend to cancel out with large sample sizes, primarily affecting precision. In contrast, systematic error introduces consistent, directional bias that skews all measurements away from the true biological state, directly compromising accuracy [34] [35]. In single-cell perturbation studies, this manifests as structured transcriptional changes that are not specific to the intended perturbation.

The table below categorizes common sources of systematic variation in single-cell genomics:

Table: Common Sources of Systematic Variation in Perturbation Studies

Source Category	Specific Example	Impact on Data
Experimental Design	Selection of a perturbation panel targeting biologically related genes (e.g., cell cycle genes) [15]	Introduces consistent transcriptomic differences between perturbed and control cells.
Sample Preparation	Tissue dissociation protocols triggering cellular stress responses [33]	Induces artificial expression of stress genes (e.g., fos/jun, heat shock genes).
Biological Confounders	Underlying biological factors (e.g., cell-cycle phase, chromatin landscape) [15]	Causes widespread shifts, such as cell-cycle arrest in p53-positive cells post-perturbation.
Measurement Artifacts	Instrument calibration or consistent operator error [34]	Affects all measurements in a consistent direction or proportion.

The core problem is that these systematic effects can be biologically real (e.g., a genuine stress response) but are "systematic in effect," meaning they occur broadly across many perturbations and are not specific to the gene being targeted. This obscures the unique, perturbation-specific signal that models aim to predict [15].

Benchmarking Single-Cell Foundation Models (scFMs)

Performance Comparison Against Simple Baselines

Recent benchmarking studies reveal a sobering reality: state-of-the-art deep-learning models often fail to surpass deliberately simple baselines in predicting transcriptional responses to unseen genetic perturbations [2]. The PertEval-scFM framework and other independent benchmarks have shown that complex foundation models do not provide consistent improvements for this task [3] [15].

The following table summarizes a key benchmark comparing sophisticated models against simple baselines on the task of predicting outcomes for unseen single-gene perturbations:

Table: Benchmarking scFMs vs. Baselines for Unseen Single-Gene Perturbation Prediction

Model Type	Example Models	Key Finding	Performance Summary vs. Baselines
Single-Cell Foundation Models (scFMs)	scGPT, scFoundation, Geneformer	Struggled to generalize beyond systematic variation [15].	Did not consistently outperform simple baselines [2].
Other Deep Learning Models	GEARS, CPA	Performance was comparable or inferior to nonparametric baselines [15].	Matched or outperformed by simple averages.
Simple Nonparametric Baselines	"Perturbed Mean", "Matching Mean"	Captured average treatment effects and systematic differences effectively [15].	Performed comparably or outperformed state-of-the-art methods [15].

For the more complex task of predicting double-gene perturbations, the "matching mean" baseline—which simply averages the expression profiles of the two corresponding single-gene perturbations—outperformed all other models, including GEARS, by a considerable margin (11% improvement for the PearsonΔ metric) [15]. Furthermore, a linear model using pretrained embeddings sometimes outperformed the very foundation models from which those embeddings were extracted [2].

Diagram: Performance of complex scFMs is often confounded by systematic variation, while simple baselines capture it effectively.

The Systema Evaluation Framework

To address the confounding effect of systematic variation, the Systema framework was introduced. It shifts the evaluation paradigm from simply measuring the similarity between predicted and observed expression profiles to assessing a model's ability to reconstruct the true "perturbation landscape" [15].

The core principles of Systema are:

Mitigating Systematic Biases: It emphasizes the prediction of perturbation-specific effects by controlling for the average treatment effect that is easily learned by simple baselines.
Interpretable Readout: It evaluates whether a model can correctly position unseen perturbations relative to known ones in a meaningful space, thus testing for genuine biological understanding rather than memorization of systematic shifts [15].

When evaluated under this more rigorous framework, the task of generalizing to unseen perturbations proves substantially harder than standard metrics suggest. Systema helps differentiate predictions that merely replicate systematic effects from those that capture biologically informative perturbation responses [15].

Key Experimental Protocols and Workflows

The scSLAM-seq Method for Measuring Dissociation Artifacts

A critical source of systematic variation is the cellular stress response triggered during tissue dissociation for single-cell RNA sequencing. The scSLAM-seq (single-cell thiol-linked alkylation for RNA sequencing) protocol was developed to directly measure and correct for this artifact [33].

Experimental Workflow:

4sU Incorporation: The uridine analog 4-thiouridine (4sU) is added to the dissociation medium. As cells transcribe RNA during the dissociation process, 4sU is incorporated into newly synthesized transcripts.
Cell Dissociation: Standard tissue dissociation procedures are carried out in the presence of 4sU.
Thiol-Alkylation: After cell fixation, a thiol-modified iodoacetamide treatment is performed. This step chemically modifies the 4sU-containing RNAs.
Library Prep & Sequencing: Single-cell libraries are prepared (e.g., using the 10x Genomics Chromium system) and sequenced.
Bioinformatic Identification: Transcripts synthesized during dissociation are identified during data analysis by characteristic T-to-C substitutions introduced by the chemical conversion step [33].

Diagram: The scSLAM-seq workflow labels and identifies dissociation-induced transcripts.

This methodology has demonstrated that dissociation can induce general stress response genes (e.g., fos/jun) as well as cell-type-specific response programs. It also reveals significant sample-to-sample variation in dissociation response, even under controlled conditions, highlighting a potential source of batch effects [33].

Quantifying Systematic Variation in Perturbation Datasets

To objectively assess the degree of systematic variation in a given perturbation dataset, the following analytical protocol is recommended:

Gene Set Enrichment Analysis (GSEA): Perform GSEA between all perturbed cells and all control cells (not per perturbation). The enrichment of pathways not specifically targeted by the perturbation panel (e.g., "response to chemical stress," "unfolded protein response") is a strong indicator of systematic variation [15].
AUCell Scoring: Use AUCell to score the activity of the enriched pathways from step 1 in single cells. Visualize the distribution of scores to confirm consistent differences between the perturbed and control populations [15].
Cell Cycle Analysis: Assign each cell to a cell cycle phase based on canonical markers. A significant disparity in the distribution of cells across phases between perturbed and control groups (e.g., an overabundance of perturbed cells in G1 phase due to cell-cycle arrest) quantifies a major biological confounder [15].
Systematic Variation Metric: The collective evidence from the steps above, particularly the effect sizes of the enriched pathways and cell cycle distribution shifts, provides a measure of the systematic variation present. Datasets with high systematic variation will show strong, consistent transcriptional differences between the bulk perturbed and control groups, independent of the specific perturbation [15].

Table: Key Reagent Solutions for Perturbation and Artifact Analysis

Reagent / Resource	Function	Example Use Case
4-thiouridine (4sU)	Ribonucleoside analog for metabolic labeling of newly transcribed RNA.	Labeling transcripts produced during tissue dissociation in scSLAM-seq to identify dissociation artifacts [33].
Iodoacetamide (IAA)	Thiol-reactive alkylating agent.	Chemical conversion of 4sU-labeled RNAs in the scSLAM-seq protocol to introduce T-to-C mutations for bioinformatic identification [33].
CRISPR Activation/iInterference Libraries	High-throughput tools for targeted genetic perturbation.	Introducing single or double-gene perturbations in cell lines (e.g., K562, RPE1) to generate benchmark datasets for model evaluation [15] [2].
Systema Framework	Computational evaluation framework (GitHub).	Benchmarking perturbation prediction models while controlling for systematic variation to assess true biological learning [15].
PertEval-scFM	Standardized benchmarking framework.	Evaluating zero-shot capabilities of single-cell foundation models for perturbation effect prediction [3].

Visualizing Data and Networks to Uncover Variation

Effective visualization is crucial for interpreting complex biological networks and identifying potential systematic biases.

Rule 1: Determine the Figure's Purpose: Before creating a visualization, define the specific message. Is it about network functionality or structure? This determines the choice between directed arrows (for data flow) and undirected edges (for topology) [36].
Rule 2: Consider Alternative Layouts: While node-link diagrams are common, adjacency matrices are often superior for dense networks. They minimize clutter, effectively encode edge attributes with color, and make node labels readable [36].
Rule 3: Beware of Unintended Spatial Interpretations: In node-link diagrams, viewers will naturally interpret nodes in proximity as functionally related. Using force-directed or multidimensional scaling layouts that accurately reflect a chosen similarity measure (e.g., connectivity strength, functional similarity) is essential to avoid misinterpretation [36].
Rule 4: Provide Readable Labels and Captions: Labels must be legible at publication size. If the layout prevents this, an online high-resolution, zoomable version of the network should be provided [36].

The pervasive challenge of systematic variation necessitates a paradigm shift in how we develop and evaluate single-cell perturbation models. Current evidence indicates that sophisticated single-cell foundation models often do not outperform simple baselines that primarily capture these systematic biases, particularly when predicting the effects of unseen perturbations. The path forward requires a concerted focus on robust experimental design, such as using scSLAM-seq to dissect artifacts, and the adoption of rigorous evaluation frameworks like Systema that explicitly control for non-specific effects. The ultimate goal is to build models that genuinely understand perturbation biology, moving beyond the confounding shadows cast by cell cycle, stress responses, and other sources of systematic variation.

In the rigorous field of single-cell perturbation effect prediction, where models forecast how genetic or chemical perturbations alter cellular states, the choice of evaluation metrics is paramount. Root Mean Square Error (RMSE) and a derivative metric, PearsonΔ (the Pearson correlation of predicted versus actual expression changes), have become standard tools for benchmarking model performance. However, a growing body of evidence suggests that an over-reliance on these metrics can paint a misleading picture of a model's true biological predictive power, potentially directing therapeutic discovery down unproductive paths. This guide objectively compares the performance of various computational models, framed within the broader thesis that standard evaluation protocols in single-cell foundation model (scFM) research require a critical reassessment.

The Allure and Pitfalls of Standard Metrics

To understand their limitations, one must first understand what RMSE and PearsonΔ measure.

RMSE quantifies the average magnitude of prediction error, giving higher weight to larger errors due to the squaring operation [37] [38]. It is optimal when prediction errors follow a normal (Gaussian) distribution [39].
PearsonΔ assesses the linear correlation between the predicted and actual differential expression profiles, measuring how well the predictions rank the direction of changes but not their absolute accuracy [15].

Their widespread adoption is driven by intuitive interpretation and standardization across fields [37]. However, they possess critical weaknesses in the context of complex biological perturbation data:

Sensitivity to Outliers: RMSE can be disproportionately inflated by a small number of large errors, which may not reflect overall model performance [37] [38].
Susceptibility to Systematic Variation: Both metrics can be misleadingly optimistic when the dataset contains systematic differences between control and perturbed cells. These differences can arise from selection biases (e.g., perturbing a panel of functionally related genes) or confounding factors (e.g., widespread cell-cycle arrest) [15]. A model can achieve a high PearsonΔ or low RMSE simply by learning these consistent background signals rather than the specific biological effects of a novel perturbation.

Experimental Evidence: When Simple Baselines Outperform Complex Models

Benchmarking studies have revealed the startling performance of simple baseline models against sophisticated single-cell foundation models (scFMs), highlighting the deceptive nature of standard metrics.

Quantitative Performance Comparison

The table below summarizes findings from a benchmark across ten single-cell perturbation datasets, comparing state-of-the-art methods against simple baselines. The "Perturbed Mean" baseline predicts the average expression across all perturbed cells, while the "Matching Mean" for combinatorial perturbations averages the profiles of the constituent single-gene perturbations [15].

Model Type	Model Name	Key Methodology	Reported Performance (PearsonΔ)	Performance Summary vs. Baselines
Simple Baseline	Perturbed Mean	Predicts average expression of all perturbed cells [15].	N/A	Outperformed or matched scFMs on unseen 1-gene perturbations across all datasets [15].
Simple Baseline	Matching Mean	For combo perturbations, averages centroids of constituent genes [15].	N/A	Outperformed scFMs on unseen 2-gene perturbations by ~11% (PearsonΔ) [15].
scFM	scGPT	Transformer model pre-trained on single-cell data [6].	High variability across tasks and datasets.	Did not provide consistent improvements; performance highly task-dependent [3] [6].
Specialized Model	GEARS	Leverages graph neural networks and prior knowledge of gene networks [15].	Comparable to baselines on some datasets [15].	Outperformed by Matching Mean baseline on combinatorial perturbations [15].
Specialized Model	CPA (Compositional Perturbation Autoencoder)	Uses disentanglement to separate basal cellular state from perturbation effect [40].	N/A	Performance comparable to simpler baselines [40].

Detailed Experimental Protocols

The insights in the table above are primarily derived from a rigorous benchmarking protocol:

Dataset Curation: Models are evaluated on multiple public single-cell perturbation datasets (e.g., Adamson et al., Norman et al., Replogle et al.) that include genetic perturbations across different cell lines [15].
Task Definition: The core task is unseen perturbation prediction, where models must predict the transcriptional outcomes of genetic perturbations not present in the training data [15] [40].
Model Training & Evaluation:
- State-of-the-art models (e.g., scGPT, GEARS, CPA) and simple baselines (Perturbed Mean, Matching Mean) are trained on the same data splits [15].
- Predictions are evaluated by comparing the predicted differential expression profile (perturbed vs. control) to the ground-truth profile using metrics like RMSE and PearsonΔ [15].
Analysis of Systematic Variation: Researchers perform Gene Set Enrichment Analysis (GSEA) and analyze cell-cycle phase distribution to identify and quantify non-specific systematic differences between control and perturbed cell populations that could confound the metrics [15].

The following diagram illustrates the workflow and the central problem of this evaluation paradigm: standard metrics can be gamed by models that learn systematic variation.

A Path Forward: Beyond RMSE and PearsonΔ

Recognizing the limitations of these metrics is the first step. The next is adopting more robust evaluation frameworks and metrics.

The Systema Framework

Introduced to address these specific issues, Systema is an evaluation framework that emphasizes perturbation-specific effects [15]. Its methodology includes:

Quantifying Systematic Variation: Measuring the consistent, non-specific differences between control and perturbed cells in a dataset.
Focusing on Perturbation Landscapes: Evaluating how well a model's predictions reconstruct the unique relationships between different perturbations, rather than just matching an average profile.

Alternative and Complementary Metrics

Researchers are increasingly turning to a suite of other metrics to gain a more holistic view of model performance:

Rank-based Metrics: These evaluate a model's ability to correctly order perturbations by the strength of a specific effect (e.g., which gene knockout most reverses a disease signature). This is often more relevant for designing follow-up experiments than a low RMSE [40].
Mean Absolute Error (MAE): Unlike RMSE, MAE is less sensitive to large errors, providing a more robust measure of typical error magnitude [39] [38].
Distributional Metrics: Metrics like the Energy Distance or Maximum Mean Discrepancy (MMD) assess whether the full distribution of predicted cellular states matches the real distribution, going beyond just the average effect [40].
Biology-Informed Metrics: Novel metrics like scGraph-OntoRWR check if the relationships between cell types learned by the model are consistent with established biological knowledge from cell ontologies [6].

For researchers conducting or evaluating perturbation prediction benchmarks, the following tools and datasets are essential.

Name	Type	Primary Function
Systema [15]	Evaluation Framework	A framework to evaluate models on perturbation-specific effects, mitigating the influence of systematic variation.
PerturBench [40]	Benchmarking Codebase	A modular platform for model development and evaluation across diverse perturbation tasks and datasets.
Adamson, Norman, Replogle Datasets [15]	Benchmarking Data	Key public single-cell perturbation screening datasets used for training and evaluating models.
Gene Set Enrichment Analysis (GSEA) [15]	Analytical Method	Identifies enriched biological pathways, used to diagnose the presence of systematic variation.
Rank Correlation Metrics [40]	Evaluation Metric	Measures the agreement in the ordering of perturbations, crucial for in-silico screening priorities.

The reliance on RMSE and PearsonΔ as primary metrics for evaluating perturbation prediction models is a precarious practice. As robust benchmarking studies have shown, these metrics can be gamed by systematic biases in the data, leading to the illusion of competence in models that are merely recapitulating background noise. This misdirection can have real-world consequences, wasting computational and experimental resources on models that fail to generalize. The path forward requires a shift towards more sophisticated, biology-aware evaluation frameworks like Systema and a commitment to multi-metric assessments that include rank-based and distributional metrics. By looking beyond standard metrics, the field can better select models that truly unravel the complexities of cellular perturbation biology.

Understanding how genetic perturbations affect single cells is crucial for advancing functional genomics, with wide-ranging implications for revealing gene functions, mapping regulatory networks, and accelerating therapeutic discovery [15]. The space of possible genetic perturbations is combinatorially complex, making exhaustive experimental exploration infeasible. To address this challenge, computational approaches have been developed to predict transcriptional outcomes of genetic perturbations that were never experimentally tested [15]. However, despite strong performance reported for these methods, their ability to infer the effects of truly novel perturbations remains an open question in the field [41].

Recent studies have revealed a critical methodological concern: current evaluation approaches may overestimate model performance due to systematic variation—consistent transcriptional differences between perturbed and control cells arising from selection biases or biological confounders [41] [15]. This systematic variation can lead to misleading conclusions about a model's true predictive capabilities for unseen perturbations. The Systema framework, introduced in a 2025 Nature Biotechnology publication, addresses this fundamental challenge by providing a more rigorous standard for evaluating perturbation response prediction methods [41] [15].

What is Systema? Purpose and Design Principles

Systema is an evaluation framework specifically designed to emphasize perturbation-specific effects and identify predictions that correctly reconstruct the perturbation landscape [41]. It was developed in response to the finding that existing metrics are susceptible to systematic biases, which can lead to overestimated performance [15]. The framework moves beyond traditional reference-based evaluation approaches that use control cells as the sole point of comparison [42].

The core innovation of Systema is its ability to mitigate systematic biases by focusing on perturbation-specific effects while providing an interpretable readout of methods' ability to reconstruct the perturbation landscape [42]. Instead of using control cells as a point of reference, Systema enables the use of custom references that better isolate perturbation-specific effects, including the centroid of perturbed cells [42]. This approach results in substantially lower but more realistic evaluation scores that better reflect true generalization capability to novel perturbations [42].

Quantifying Systematic Variation: A Foundational Insight

A key contribution leading to Systema's development was the quantification of systematic variation across perturbation datasets. Researchers computed the distribution of cosine similarities between perturbation-specific shifts and the average perturbation effect [42]. High cosine similarity indicates that transcriptional responses to different perturbations are aligned in a similar direction, suggesting shared, possibly non-specific shifts in gene expression [42].

This analysis revealed that the amount of systematic variation in perturbation datasets strongly correlated with the performance scores of existing perturbation response prediction methods [42]. In essence, models were achieving high scores primarily by capturing these systematic differences rather than genuine perturbation-specific effects.

Experimental Benchmarking: Systema vs. Traditional Evaluation

Benchmark Design and Methodology

The Systema benchmark was conducted across ten single-cell perturbation datasets collected from six different sources, spanning three distinct technologies and five different cell lines [15]. The datasets included varying numbers of perturbations, including a genome-wide perturbation screen and a dataset with combinatorial two-gene perturbations [15]. This diversity ensured comprehensive evaluation across different experimental conditions and perturbation types.

The benchmark compared established perturbation response methods against simple non-parametric baselines. The state-of-the-art methods included:

Compositional Perturbation Autoencoder (CPA): A baseline operating without prior knowledge [15]
GEARS: A method specifically designed for predicting transcriptional outcomes of unseen genetic perturbations [15]
scGPT: A foundation model for single-cell multi-omics using generative AI [15]

The simple baselines designed for comparison were:

Perturbed Mean: Average expression across all perturbed cells [15] [42]
Matching Mean: For combinatorial perturbations X+Y, the average of the X and Y centroids; if X or Y are unseen at training time, their centroid is replaced by the perturbed mean [15] [42]

Table 1: Performance Comparison on Unseen One-Gene Perturbations

Method	PearsonΔ (Adamson)	PearsonΔ (Norman)	PearsonΔ (Replogle)	PearsonΔ20 (Frangieh)
Perturbed Mean	Highest	Highest	Highest	Comparable
GEARS	Lower	Lower	Lower	Comparable
scGPT	Lower	Lower	Lower	Highest
CPA	Lower	Lower	Lower	Lower

Table 2: Performance on Unseen Two-Gene Perturbations (Norman Dataset)

Method	PearsonΔ	Relative Improvement over Best Alternative
Matching Mean	Highest	11% improvement over GEARS
GEARS	Lower	Baseline
scGPT	Lower	-
CPA	Lower	-

Key Findings from Comparative Analysis

The benchmark results revealed several critical insights that challenged conventional understanding in the field:

Simple baselines performed comparably or superior to state-of-the-art methods across different datasets and evaluation metrics [15] [42]. For unseen one-gene perturbations, the perturbed mean baseline outperformed other methods across all datasets using the PearsonΔ score [15].
For combinatorial perturbations, the matching mean baseline outperformed all other methods by a considerable margin, with relative improvements of 11% for PearsonΔ over the best alternative method (GEARS) [15].
The predicted differential expression profiles across all methods were similar to each other and correlated with those of the perturbed mean, suggesting that perturbation response prediction methods predominantly capture systematic differences rather than perturbation-specific effects [15].

Further investigation revealed specific sources of systematic variation that explained the strong performance of simple baselines:

In the Adamson dataset, systematic differences were observed in activity scores between perturbed and control cells for multiple pathways, including response to external stimuli, response to chemical stress, and positive regulation of cell death [15].
In the Norman dataset, researchers observed positive activation of cellular death and downregulation of stress response pathways in perturbed cells, including cellular response to heat and to unfolded protein [15].
In the Replogle RPE1 dataset, a significant difference was found in the distribution of cells across cell-cycle phases between perturbed and control cells, with 46% of perturbed cells versus 25% of control cells in the G1 phase [15]. This was attributed to widespread chromosomal instability induced by perturbations [15].

Systematic Variation Sources: This diagram illustrates how different factors contribute to systematic variation in perturbation datasets, ultimately leading to overestimated model performance.

Systema's Evaluation Methodology and Metrics

Core Evaluation Framework

Systema introduces a fundamentally different approach to evaluation compared to traditional metrics. The key methodological innovation is the replacement of control cells as the reference point with more appropriate references that better isolate perturbation-specific effects [42]. This approach includes:

Custom references: Instead of using control cells as the sole point of reference, Systema enables using alternative references, including the centroid of perturbed cells [42].
Redefined metrics: Standard evaluation metrics are redefined using the perturbed centroid as reference rather than control cells [42].
Centroid accuracy: An intuitive evaluation metric that measures whether predicted post-perturbation profiles are closer to their correct ground-truth centroid than to the centroids of other perturbations [42].

Application of Systema with these modified references resulted in substantially lower evaluation scores across all methods, demonstrating that generalizing to unseen genetic perturbations is substantially more challenging than traditional metrics suggest [42].

Centroid Accuracy for Biological Utility Assessment

A key innovation in Systema is the centroid accuracy metric, which provides a more biologically meaningful assessment of prediction quality [42]. A centroid accuracy of 1 indicates that inferred profiles perfectly recover the expected transcriptional effects of a perturbation [42]. When this metric was applied to evaluate predictions on unseen one-gene perturbations across ten datasets, the average perturbation scores barely exceeded those of the perturbed mean baseline [42].

To further evaluate biological utility, the centroid accuracy was extended to test whether predicted centroids could distinguish coarse-grained perturbation effects [42]. In one application, researchers used inferred centroids to classify unseen perturbations as inducing either low or high chromosomal instability (CIN) in the genome-wide K562 perturbation screen [42]. Among all methods, only the finetuned version of scGPT achieved a ROC-AUC substantially above chance (AUC=0.7) [42].

Centroid Accuracy Evaluation: This diagram illustrates Systema's centroid accuracy metric, which measures whether predicted profiles are closer to their correct ground-truth centroid than to other perturbation centroids.

Research Reagent Solutions: Essential Materials for Perturbation Studies

Table 3: Key Research Reagents and Computational Tools for Perturbation Studies

Resource Name	Type	Function/Application	Reference
Adamson Dataset	Experimental Data	Investigates endoplasmic reticulum homeostasis through targeted perturbations	[15]
Norman Dataset	Experimental Data	Examines cell cycle and growth processes via combinatorial perturbations	[15]
Replogle Dataset	Experimental Data	Genome-wide perturbation screen in RPE1 and K562 cell lines	[15] [42]
Frangieh Dataset	Experimental Data	Multi-modal pooled Perturb-CITE-seq screens in patient models	[42]
GEARS Codebase	Computational Tool	Data processing and model implementation framework	[42]
scGPT	Computational Model	Foundation model for single-cell multi-omics using generative AI	[15]
CPA	Computational Model	Compositional Perturbation Autoencoder for perturbation modeling	[15]
Systema GitHub	Evaluation Framework	Implementation of Systema evaluation framework	[15]

Implications for the Field and Future Directions

The introduction of Systema has profound implications for perturbation effect prediction research, particularly in the evaluation of single-cell foundation models (scFMs). By revealing that current methods struggle to generalize beyond systematic variation, Systema challenges the field to develop more robust approaches that capture genuine biological effects rather than leveraging dataset-specific biases [41] [15].

Looking forward, the developers of Systema suggest that perturbation response models should be evaluated based on their biological utility—how inferred perturbation profiles help answer downstream queries about relevant cellular phenotypes [42]. Framing evaluation in terms of downstream tasks may offer a more meaningful and practical perspective than traditional metrics [42]. Emerging perturbation platforms like optical pooled screens and spatial functional genomics screens, which combine perturbation data with cell morphology, spatial context, and tissue-level features, present particularly rich opportunities for this type of evaluation [42].

For researchers and drug development professionals, Systema offers a more rigorous framework for validating computational models before their application in therapeutic discovery pipelines. By ensuring that models capture genuine perturbation-specific effects rather than systematic biases, Systema can help increase confidence in computational predictions and accelerate the translation of perturbation insights into clinical applications.

Systema represents a paradigm shift in how the field evaluates genetic perturbation response prediction methods. By moving beyond traditional metrics that are susceptible to systematic variation and introducing novel evaluation approaches that emphasize perturbation-specific effects, Systema provides a more rigorous and biologically meaningful standard for assessment. The framework's demonstration that simple baselines often outperform complex models on traditional metrics underscores the critical importance of proper evaluation methodology in advancing the field. As perturbation modeling continues to play an increasingly important role in functional genomics and therapeutic discovery, Systema offers an essential toolkit for ensuring that computational methods genuinely advance our understanding of biological systems rather than merely capturing dataset-specific biases.

In the rapidly evolving field of single-cell biology, the development of single-cell foundation models (scFMs) represents a transformative advance with profound implications for understanding cellular processes and disease mechanisms. These models, trained on millions of single-cell transcriptomes, promise to learn fundamental biological principles that generalize across diverse cell types, states, and conditions [14]. Within this context, perturbation effect prediction—the ability to accurately forecast how genetic manipulations alter cellular states—has emerged as a critical benchmark for scFM capability. However, recent rigorous evaluations reveal a sobering reality: despite significant computational investment, current scFMs frequently fail to outperform deliberately simple baselines on this crucial task [2] [3]. This performance gap highlights the urgent need to systematically evaluate the core optimization levers that govern scFM efficacy: data quality, pretraining strategies, and targeted fine-tuning methodologies.

The benchmarking evidence is striking. Multiple independent studies have demonstrated that scFM embeddings provide no consistent improvement over simpler approaches, particularly when predicting strong or atypical perturbation effects or operating under distribution shift [2] [3] [25]. These findings underscore fundamental limitations in current optimization approaches and necessitate a critical examination of how data quality, pretraining architectures, and fine-tuning protocols can be leveraged to enhance model performance. This review synthesizes current benchmarking results, analyzes the experimental methodologies underlying these findings, and identifies the most promising optimization pathways for advancing perturbation prediction capabilities in scFMs.

Performance Comparison: scFMs vs. Baselines in Perturbation Prediction

Recent systematic benchmarking efforts have yielded consistent findings across multiple studies and model architectures, revealing significant performance gaps in perturbation prediction tasks. The following comparative analysis synthesizes quantitative results from these evaluations.

Table 1: Performance Comparison in Double Perturbation Prediction (Norman et al. Dataset)

Model Type	Model Name	Prediction Error (L2 Distance)	Genetic Interaction Detection	Key Limitations
Simple Baselines	Additive Model	Lowest	Cannot predict interactions	Serves as performance reference
	No Change Model	Moderate	Poor performance	Predicts no change from control
Single-cell Foundation Models	scGPT	Higher than baseline	No improvement over "no change"	Predictions show minimal variation across perturbations
	scFoundation	Higher than baseline	Limited capability	Requires specific gene sets; less flexible
	Geneformer	Higher than baseline	Limited capability	Struggles with strong effect predictions
	scBERT	Higher than baseline	No improvement over "no change"	Predictions show minimal variation
	UCE	Higher than baseline	No improvement over "no change"	Predictions show minimal variation
Other Deep Learning Models	GEARS	Higher than baseline	Mostly predicts buffering interactions	Limited variation in predictions
	CPA	Highest	Not designed for unseen perturbations	Uncompetitive in this benchmark

Table 2: Performance in Unseen Perturbation Prediction (Replogle et al. Datasets)

Model/Approach	Prediction Accuracy	Data Efficiency	Computational Cost
Mean Prediction Baseline	Moderate	High	Low
Linear Model with Training Data Embeddings	Competitive with scFMs	High	Low
scGPT with In-built Decoder	Lower than or equal to baselines	Low	High
GEARS with In-built Decoder	Lower than or equal to baselines	Low	High
Linear Model with scGPT Embeddings	Similar to scGPT itself	Moderate	Moderate
Linear Model with Perturbation Pretraining	Highest	High	Moderate

The consistent pattern across these benchmarks indicates that current scFMs, despite their architectural complexity and extensive pretraining, fail to demonstrate superior performance in perturbation prediction compared to simpler, more direct approaches [2]. This performance gap is particularly evident in challenging scenarios such as predicting genetic interactions or extrapolating to unseen perturbations. Notably, the "additive model," which simply sums the individual logarithmic fold changes of single perturbations to predict double perturbation effects, sets a surprisingly high performance bar that current deep learning models have not consistently surpassed [2].

The limitations extend beyond quantitative metrics to qualitative shortcomings. Most models predominantly predict "buffering" interactions (where the double perturbation effect is less than expected) and rarely correctly identify "synergistic" interactions (where the combined effect is greater than expected) [2]. Furthermore, model predictions often show insufficient variation across different perturbations, suggesting a failure to capture the specific biological consequences of distinct genetic manipulations [2].

Experimental Protocols and Methodologies

The benchmarking studies employed rigorous experimental designs to ensure fair and informative comparisons between scFMs and baseline approaches. Understanding these methodologies is crucial for interpreting the results and designing future optimization strategies.

Double Perturbation Benchmarking Protocol

The evaluation of double perturbation prediction followed a standardized protocol using data from Norman et al., which measured transcriptional responses to 100 individual gene perturbations and 124 paired gene perturbations in K562 cells using CRISPR activation technology [2]. The experimental workflow included:

Data Partitioning: Models were fine-tuned on all 100 single perturbations and a randomly selected subset of 62 double perturbations (50%), with the remaining 62 double perturbations held out for testing. This process was repeated across five different random partitions to ensure robustness.
Evaluation Metrics: Primary evaluation employed L2 distance between predicted and observed expression values for the 1,000 most highly expressed genes. Additional validation used Pearson delta measure and L2 distances for other gene subsets (most highly expressed or most differentially expressed genes).
Genetic Interaction Analysis: Genetic interactions were operationally defined as double perturbation phenotypes that differed from additive expectations beyond what would be expected under a normal distribution null model. At a 5% false discovery rate, 5,035 significant genetic interactions were identified from the complete dataset.
Interaction Classification: Predictions were categorized into interaction types: "buffering" (weaker than additive effect), "synergistic" (stronger than additive effect), or "opposite" (qualitatively different effect).

Unseen Perturbation Evaluation Protocol

The assessment of model capability to generalize to completely novel perturbations employed data from Replogle et al. (K562 and RPE1 cell lines) and Adamson et al. (K562 cells) [2]. The methodology included:

Baseline Construction: A simple linear model framework was developed where read-out genes are represented by K-dimensional vectors and perturbations by L-dimensional vectors, with the model solving: argmin_W ||Y_train - (GWP^T + b)||₂² where Y_train contains gene expression values, G represents gene embeddings, P represents perturbation embeddings, and b is the vector of row means [2].
Embedding Extraction: Gene embedding matrices were extracted from scFoundation and scGPT, while perturbation embedding matrices were extracted from GEARS to test whether pretrained representations contained valuable biological knowledge.
Cross-Cell Line Validation: Models were pretrained on one cell line (e.g., K562) and evaluated on another (e.g., RPE1) to assess generalization capability across biological contexts.

Closed-Loop Fine-Tuning Methodology

A more recent approach introduced "closed-loop" fine-tuning that incorporates experimental perturbation data directly into the model refinement process [26]. This methodology includes:

Model Fine-Tuning: The base scFM (Geneformer-30M-12L) is first fine-tuned to classify cellular states (e.g., activated vs. resting T-cells) using relevant single-cell RNA sequencing data.
Perturbation Data Integration: The model is further fine-tuned with single-cell RNA sequencing data from CRISPR activation/interference screens (Perturb-seq), labeled with cellular activation status but not specific perturbation identities.
Iterative Refinement: Model performance is evaluated with incrementally increasing perturbation examples to determine the minimal data required for substantial improvement.
Therapeutic Application: The optimized model is applied to disease contexts (e.g., RUNX1-familial platelet disorder) to identify potential therapeutic targets through in silico perturbation screening.

Critical Analysis of Optimization Levers

The benchmarking results provide critical insights into the relative importance of different optimization approaches for enhancing scFM performance in perturbation prediction.

Data Quality and Composition

The composition and quality of training data emerge as fundamental determinants of scFM performance. Current scFMs are typically pretrained on large-scale single-cell atlases containing tens of millions of cells spanning diverse tissues and conditions [14]. While this approach captures broad biological variability, it appears insufficient for excelling at perturbation prediction. Several key findings highlight this limitation:

Perturbation-Specific Data Trumps Scale: Linear models using embeddings pretrained on perturbation data consistently outperformed those using scFM embeddings pretrained on broader single-cell atlas data [2]. This suggests that data relevance may be more important than dataset size for this specific task.
Distribution Shift Vulnerability: scFM embeddings show particularly poor performance under distribution shift, indicating that models trained on "normal" cellular states struggle to generalize to strongly perturbed conditions [3].
Data Quality Challenges: Single-cell data suffers from batch effects, technical noise, and variable processing steps that can introduce confounding patterns [14]. While some models demonstrate robustness to these artifacts, data quality issues likely contribute to the performance limitations in perturbation prediction.

Pretraining Strategies and Architectural Choices

Current scFMs predominantly employ transformer architectures, adapting either BERT-like encoder designs or GPT-inspired decoder frameworks [14]. However, benchmarking results suggest that architectural sophistication alone does not guarantee performance advantages for perturbation prediction:

Tokenization Challenges: Unlike natural language, gene expression data lacks natural sequential ordering, requiring artificial tokenization strategies such as ranking genes by expression level or binning expression values [14]. These arbitrary orderings may obscure biologically meaningful gene-gene relationships crucial for accurate perturbation prediction.
Attention Mechanism Limitations: While self-attention layers theoretically enable models to learn complex gene regulatory relationships, current implementations appear insufficient for capturing the intricate biological logic underlying genetic interactions [2].
Embedding Utility: The fact that linear models using extracted scFM embeddings perform similarly to the full models suggests that these embeddings may not capture the specialized information needed for perturbation prediction [2].

Targeted Fine-Tuning Methodologies

Fine-tuning strategies represent the most promising optimization lever, with "closed-loop" approaches demonstrating significant performance improvements:

Closed-Loop Advantage: Incorporating experimental perturbation data during fine-tuning dramatically improves prediction accuracy, increasing positive predictive value from 3% to 9% in T-cell activation models while also enhancing sensitivity (76% vs. 48%) and specificity (81% vs. 60%) [26].
Data Efficiency: Performance improvements saturate with approximately 20 perturbation examples, suggesting that even modest experimental investments can substantially enhance model accuracy [26].
Task-Specific Adaptation: Fine-tuning protocols that explicitly incorporate perturbation effects alongside cellular state information enable models to better capture the causal relationships between genetic manipulations and phenotypic outcomes [26].

Table 3: Impact of Closed-Loop Fine-Tuning on Prediction Metrics (T-cell Activation Model)

Evaluation Metric	Open-Loop ISP	Closed-Loop ISP	Relative Improvement
Positive Predictive Value (PPV)	3%	9%	3x
Negative Predictive Value (NPV)	98%	99%	1%
Sensitivity	48%	76%	58%
Specificity	60%	81%	35%
AUROC	0.63	0.86	37%

Successful implementation of scFM optimization requires specific computational resources, datasets, and analytical tools. The following table details key components of the experimental pipeline for perturbation prediction studies.

Table 4: Essential Research Reagents and Computational Resources

Resource Type	Specific Examples	Function and Application
Single-cell Foundation Models	Geneformer-30M-12L, scGPT, scBERT, UCE, scFoundation	Base models for fine-tuning and perturbation prediction
Benchmark Datasets	Norman et al. (CRISPRa in K562), Replogle et al. (CRISPRi in K562/RPE1), Adamson et al. (K562 perturbations)	Standardized data for model training and evaluation
Evaluation Frameworks	PertEval-scFM	Standardized benchmarking of perturbation prediction performance
Computational Libraries	TensorFlow, PyTorch, Optuna, Ray Tune	Model implementation, training, and hyperparameter optimization
Specialized Toolkits	Intel OpenVINO, ONNX Runtime	Model optimization and acceleration for efficient inference
Biological Databases	CZ CELLxGENE, Human Cell Atlas, PanglaoDB, GEO/SRA	Sources of diverse single-cell data for pretraining and fine-tuning

The comprehensive benchmarking of single-cell foundation models for perturbation effect prediction reveals a critical performance gap between current model capabilities and practical biological applications. Despite their architectural sophistication and extensive pretraining, scFMs consistently fail to outperform simpler, more direct approaches for predicting transcriptional responses to genetic perturbations. This limitation underscores the need for more targeted optimization strategies that prioritize data quality, biological relevance, and specialized fine-tuning over sheer model scale and diversity of pretraining data.

The most promising path forward centers on "closed-loop" optimization frameworks that iteratively integrate experimental perturbation data into model refinement. This approach, which increases positive predictive value three-fold with relatively modest data requirements, demonstrates the power of combining foundational pretraining with targeted, task-specific fine-tuning. Future advances will likely depend on developing specialized architectures specifically designed for capturing causal relationships in biological systems, coupled with higher-quality perturbation datasets that more comprehensively probe genetic interactions across diverse cellular contexts.

For researchers and drug development professionals, these findings suggest a pragmatic approach to leveraging scFMs for perturbation prediction. Currently, simpler baseline models provide competitive performance with significantly lower computational costs. However, as optimization methodologies mature—particularly through improved data quality, specialized pretraining, and targeted fine-tuning—scFMs hold immense potential to eventually deliver on their promise as virtual cells for in silico therapeutic discovery and biological mechanism elucidation.

The Benchmarking Landscape: scFMs vs. Baselines Across Diverse Datasets

The ability to accurately predict transcriptional outcomes of genetic perturbations is a central challenge in functional genomics, with profound implications for understanding gene function, mapping regulatory networks, and accelerating therapeutic discovery [15]. Single-cell RNA sequencing technologies, particularly high-throughput perturbation screens like Perturb-seq, have generated vast amounts of data on cellular responses to genetic interventions [43]. In response, numerous computational methods—especially single-cell foundation models (scFMs) and other deep learning approaches—have been developed to predict effects of both single-gene and combinatorial perturbations, with the ultimate goal of generalizing to entirely unseen genetic interventions [15] [2].

This comparison guide provides an objective performance evaluation of state-of-the-art perturbation prediction methods, with particular focus on their capabilities for predicting double-gene perturbation effects and generalizing to unseen perturbations. We synthesize recent benchmarking studies and experimental validations to offer researchers, scientists, and drug development professionals a clear assessment of the current landscape, methodological considerations, and practical performance expectations for these tools.

Performance Comparison Tables

Double Perturbation Prediction Performance

Table 1: Performance comparison of methods predicting double-gene perturbation effects on the Norman et al. (2019) K562 CRISPRa dataset. Prediction error is measured as L2 distance between predicted and observed expression values for the top 1,000 highly expressed genes [2].

Method	Type	Prediction Error (L2)	Key Characteristics
Additive Model	Simple Baseline	Lowest	Sum of individual logarithmic fold changes [2]
No Change Model	Simple Baseline	Intermediate	Predicts same expression as control condition [2]
GEARS	Deep Learning	Higher than baselines	Integrates knowledge graphs [2]
scGPT	Foundation Model	Higher than baselines	Pretrained on single-cell data [2]
CPA	Deep Learning	Highest	Not designed for unseen perturbations [2]

Unseen Perturbation Prediction Performance

Table 2: Performance comparison on predicting effects of completely unseen single-gene perturbations across multiple datasets (Adamson et al., Replogle et al.). Performance metrics include Pearson correlation between predicted and actual expression changes [15] [2].

Method	Adamson Dataset	Replogle K562	Replogle RPE1	Key Characteristics
Perturbed Mean	Best	Best	Best	Average expression across all perturbed cells [15]
Linear Model with Pretrained P	Competitive	Competitive	Competitive	Embeddings pretrained on perturbation data [2]
scGPT	Intermediate	Intermediate	Intermediate	Foundation model with biological pretraining [2]
GEARS	Intermediate	Intermediate	Intermediate	Uses Gene Ontology annotations [2]
Matching Mean	Not Applicable	Not Applicable	Not Applicable	For combinatorial perturbations only [15]

Experimental Protocols and Methodologies

Double Perturbation Benchmarking Protocol

The standard evaluation protocol for double perturbation prediction utilizes the Norman et al. (2019) dataset, which contains CRISPR activation (CRISPRa) perturbations of 100 individual genes and 124 gene pairs in K562 cells, with single-cell RNA sequencing measurements of 19,264 genes [2]. The established methodology involves:

Data Partitioning: Models are fine-tuned on all 100 single perturbations and a subset of 62 double perturbations, with the remaining 62 double perturbations held out for testing. For robustness, analyses are typically repeated across five random partitions [2].
Evaluation Metrics: Multiple metrics are employed including:
- L2 distance between predicted and observed expression values
- Pearson correlation between ground truth and predicted expression changes (PearsonΔ)
- Pearson correlation focused on top 20 differentially expressed genes (PearsonΔ20)
- Root mean-squared error (RMSE) [15] [2]
Genetic Interaction Analysis: Methods are evaluated on their ability to predict genetic interactions, defined as double perturbation phenotypes that significantly deviate from additive expectations. Performance is measured using true-positive rate and false discovery proportion curves across threshold values [2].

Unseen Perturbation Evaluation Framework

For assessing generalization to entirely unseen perturbations, the Systema framework has been developed to address limitations of standard evaluation metrics [15]. Key methodological components include:

Dataset Selection: Evaluation across ten single-cell perturbation datasets spanning three technologies (CRISPRi, CRISPRa, Perturb-seq) and five cell lines (K562, RPE1, etc.), including genome-wide screens [15].
Systematic Variation Control: Quantification and adjustment for systematic differences between perturbed and control cells caused by selection biases or biological confounders, which can artificially inflate performance metrics [15].
Perturbation-Specific Focus: Emphasis on evaluating the prediction of perturbation-specific effects rather than systematic variation patterns through:
- Analysis of model performance on heterogeneous gene panels
- Assessment of biological coherence in predicted perturbation effects
- Evaluation of perturbation landscape reconstruction capability [15]

Signaling Pathways and Experimental Workflows

Perturbation Prediction Evaluation Workflow - This diagram illustrates the comprehensive benchmarking methodology for evaluating perturbation prediction methods, highlighting the parallel assessment of double perturbation prediction and unseen perturbation generalization capabilities.

Systematic Variation Impact on Evaluation - This diagram outlines how systematic variation in perturbation datasets affects performance evaluation, explaining why simple baselines can outperform complex models and highlighting specific examples from benchmark datasets.

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential research reagents and computational resources for perturbation prediction studies.

Resource	Type	Function	Example Applications
CRISPR Activation (CRISPRa)	Molecular Tool	Targeted gene overexpression using catalytically dead Cas9 fused to transcriptional activators [43]	Norman et al. K562 perturbation screen [2]
CRISPR Interference (CRISPRi)	Molecular Tool	Targeted gene repression using catalytically dead Cas9 fused to repressive domains [43]	Replogle et al. genome-wide screens [2]
Perturb-seq	Experimental Method	Combined CRISPR perturbation with single-cell RNA sequencing to measure transcriptomic effects [44] [43]	High-throughput perturbation screening [44]
Gene Ontology (GO) Annotations	Knowledge Base	Structured biological knowledge for gene function prediction and model generalization [2]	GEARS model extrapolation [2]
Protein-Protein Interaction Networks	Knowledge Graph	Prior biological network information for graph-based models [44]	GEARS and GraphReach implementations [44]

Key Findings and Interpretation

Performance Insights

The consistent underperformance of complex deep learning models relative to simple baselines across multiple benchmarks reveals several critical insights about the current state of perturbation prediction:

Systematic Variation Dominance: Simple baselines like the perturbed mean (average expression across all perturbed cells) perform comparably or superior to state-of-the-art methods because they effectively capture systematic differences between control and perturbed cells, which often dominate the signal in perturbation datasets [15].
Evaluation Metric Limitations: Standard evaluation metrics are highly susceptible to systematic variation, leading to overestimated performance for methods that primarily capture these consistent patterns rather than perturbation-specific effects [15].
Generalization Challenges: Models struggle to generalize beyond the systematic variation present in training data, with true zero-shot prediction of entirely novel perturbation effects remaining particularly challenging [15] [2].

Methodological Considerations

For researchers applying these methods in practical settings, several methodological considerations emerge from the benchmarking results:

Baseline Implementation: Always include simple baselines (perturbed mean, additive model) as reference points when evaluating new perturbation prediction methods [15] [2].
Dataset Awareness: Understand the specific systematic variations in benchmark datasets, such as the cell cycle focus in Norman et al. or ER homeostasis focus in Adamson et al., as these strongly influence performance outcomes [15].
Evaluation Strategy: Employ comprehensive evaluation frameworks like Systema that specifically address systematic variation and focus on perturbation-specific effects rather than relying solely on standard correlation metrics [15].
Architecture Selection: Consider simpler architectures or hybrid approaches, as current evidence suggests that model complexity does not necessarily translate to improved perturbation prediction capability [2].

Future Directions

Recent methodological advances suggest promising directions for improving perturbation prediction:

Closed-Loop Frameworks: Incorporating experimental perturbation data during model fine-tuning has shown potential for significant performance improvements, with one study demonstrating a three-fold increase in positive predictive value compared to standard approaches [26].
Advanced Architectures: Models like PerturbNet, which utilize conditional normalizing flows to map perturbation representations to cell state distributions, show improved performance for predicting effects of completely unseen genes and can handle diverse perturbation types including small molecules and missense mutations [43].
Efficient Training Strategies: Approaches like GraphReach that optimize training perturbation selection through graph-based subset selection can substantially accelerate model development while maintaining competitive accuracy [44].

The field continues to evolve rapidly, with ongoing efforts focused on developing more robust evaluation frameworks, incorporating additional biological prior knowledge, and creating models that can genuinely generalize beyond their training data to enable accurate prediction of novel therapeutic interventions.

Single-cell foundation models (scFMs) represent a groundbreaking advancement in computational biology, applying transformer-based architectures to massive single-cell transcriptomics datasets [14]. Trained on millions of cells across diverse tissues and conditions, these models promise to learn universal biological principles that enable prediction of cellular behaviors—including responses to genetic perturbations [6] [45]. The theoretical potential is transformative: in-silico simulation of genetic intervention effects could accelerate therapeutic discovery by prioritizing experiments most likely to yield valuable biological insights [26] [40].

However, comprehensive benchmarking studies reveal a significant gap between this promise and current capabilities, particularly for predicting genetic interactions in combinatorial perturbation scenarios. This guide synthesizes evidence from recent rigorous evaluations to objectively compare scFM performance against simpler alternatives, providing researchers with evidence-based recommendations for model selection in perturbation analysis.

Performance Benchmarking: scFMs vs. Simple Baselines

Double Perturbation Prediction

A critical benchmark assesses model performance in predicting expression changes after dual-gene perturbations, which requires capturing non-additive genetic interactions. As shown in Table 1, multiple scFMs fail to outperform deliberately simple baselines on this complex task.

Table 1: Performance comparison on double perturbation prediction (Norman et al. dataset)

Model Category	Specific Models	Performance vs. Additive Baseline	Key Limitations
Single-cell Foundation Models	scGPT, scFoundation, Geneformer, UCE, scBERT	Substantially higher prediction error [2]	Predictions show minimal variation across perturbations [2]
Task-Specific Deep Learning	GEARS, CPA	Higher prediction error than baselines [2]	Limited ability to represent genetic interactions [2]
Simple Baselines	Additive model (sum of individual LFCs), No-change model	Reference performance [2]	Additive model cannot predict interactions by design [2]

When predicting genetic interactions specifically—defined as double perturbation phenotypes that significantly deviate from additive expectations—no model outperformed the "no change" baseline. All deep learning models primarily predicted buffering interactions and rarely identified synergistic interactions correctly [2].

Single Perturbation Prediction Across Contexts

Benchmarking extended to covariate transfer tasks, where models trained on perturbations in one cellular context must predict effects in another context. As shown in Table 2, simple linear approaches remain highly competitive.

Table 2: Performance on unseen perturbation prediction across cell lines

Model Type	Examples	Performance vs. Linear Baselines	Data Requirements
Foundation Models with Fine-tuning	scGPT, Geneformer	Do not consistently outperform mean prediction or linear models [2]	Extensive pretraining + task-specific fine-tuning
Linear Models	Equation (1) with trained embeddings	Competitive or superior to foundation models [2]	Task-specific training data only
Mean Prediction	Simple average	Surprisingly difficult to outperform [2]	None (most basic baseline)

Notably, incorporating pretrained gene embeddings from scFoundation or scGPT into linear models matched or exceeded the performance of the full foundation models with their native decoders. However, the most effective approach combined linear models with perturbation embeddings pretrained on orthogonal perturbation data [2].

Experimental Protocols in Benchmarking Studies

Double Perturbation Benchmarking Methodology

The benchmark revealing scFMs' struggles with genetic interactions employed rigorous methodology:

Data Source: Norman et al. dataset with 100 individual gene and 124 paired gene perturbations using CRISPR activation in K562 cells [2]
Training-Test Split: Models fine-tuned on all 100 single perturbations and 62 of the double perturbations [2]
Evaluation Set: 62 held-out double perturbations across five random partitions [2]
Primary Metric: L2 distance between predicted and observed expression for the 1,000 most highly expressed genes [2]
Interaction Analysis: Identification of genetic interactions as phenotypes differing from additive expectations beyond a normal distribution null model [2]

Covariate Transfer Evaluation Framework

The PerturBench framework established standardized evaluation for cross-context prediction:

Task Definition: Predict perturbation effects in biological states (cell types/lines) not seen during training [40]
Datasets: Multiple datasets spanning genetic and chemical perturbations across diverse cell types [40]
Evaluation Metrics: Combination of traditional metrics (RMSE) and rank-based metrics assessing perturbation ordering [40]
Data Splitting: Strict separation ensuring no perturbation condition overlaps between training and test sets [46]

Diagram Title: scFM Benchmarking Workflow for Perturbation Prediction

Critical Analysis: Why scFMs Struggle with Genetic Interactions

Technical Limitations

Multiple technical factors contribute to the performance gap in genetic interaction prediction:

Mode Collapse: Several scFMs exhibit limited variation in predictions across different perturbations, essentially reverting to baseline behaviors [2] [40]
Representation Learning Gaps: Pretraining on atlas data provides minimal benefit over random embeddings for perturbation prediction [2]
Architecture-Data Mismatch: Transformer architectures designed for sequential data may not optimally capture the non-sequential, combinatorial nature of genetic interactions [6]

Data Quality and Availability Challenges

Fundamental data issues underlie modeling difficulties:

Sparse Effect Problem: Most perturbations produce small transcriptomic effects, making strong or atypical responses particularly challenging to predict [13]
Limited Combinatorial Data: Training data rarely encompasses sufficient examples of genetic interactions for effective pattern recognition [2]
Technical Noise: High dimensionality and sparsity of single-cell data obscure subtle interaction effects [6] [45]

Pathways to Improvement: Emerging Solutions

Enhanced Training Strategies

Promising approaches address current limitations through improved training methodologies:

Closed-Loop Fine-tuning: Incorporating experimental perturbation data during fine-tuning significantly improves prediction accuracy, demonstrating a three-fold increase in positive predictive value in T-cell activation studies [26]
Progressive Incorporation: Even modest numbers of perturbation examples (10-20) during fine-tuning yield substantial improvements [26]
Multi-task Learning: Simultaneous training on related tasks improves generalizability beyond single-objective optimization [40]

Architectural Innovations

Novel model designs specifically target perturbation prediction challenges:

Disentanglement Strategies: Separating basal cellular state from perturbation effects using adversarial classifiers [40]
Optimal Transport Methods: Matching control and perturbed cells to predict full response distributions rather than average effects [40]
Knowledge Integration: Incorporating prior biological knowledge through gene regulatory networks or ontological relationships [6] [46]

Diagram Title: Pathways to Improve scFM Genetic Interaction Prediction

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key experimental resources for perturbation prediction research

Resource Category	Specific Examples	Function in Research	Key Characteristics
Benchmarking Datasets	Norman et al. (double perturbations), Replogle et al. (CRISPRi/a), PerturBench collection [2] [40]	Standardized evaluation across models	Diverse modalities, combinatorial perturbations, multiple cell types [40]
Evaluation Frameworks	PerturBench, PEREGGRN, PertEval-scFM [40] [46] [13]	Consistent model comparison and metric calculation	Modular design, multiple data splitting strategies, diverse metrics [40] [46]
Baseline Models	Additive model, No-change model, Linear models with embeddings [2]	Critical performance reference points	Simple implementation, established performance floor [2]
Foundation Models	Geneformer, scGPT, scFoundation, UCE, scBERT [6] [2]	Primary test subjects for advanced capability assessment	Large-scale pretraining, transformer architectures, zero-shot capabilities [6] [14]

Current evidence demonstrates that single-cell foundation models have not yet fulfilled their potential for genetic interaction prediction, consistently failing to outperform simpler baseline methods. This performance gap stems from both technical limitations and fundamental biological data challenges.

For researchers pursuing perturbation effect prediction, evidence suggests the following strategic approach:

Employ Rigorous Baselines: Always compare scFM performance against simple additive and linear models to validate any performance advantages [2]
Prioritize Data Quality: Focus on high-quality, well-replicated perturbation datasets rather than exclusively scaling model complexity [46]
Implement Closed-Loop Fine-tuning: Incorporate even small numbers of experimental perturbation examples to substantially boost prediction accuracy [26]
Utilize Standardized Benchmarks: Leverage established frameworks like PerturBench and PEREGGRN for biologically meaningful evaluation [40] [46]

The field continues to evolve rapidly, with new architectural innovations and training strategies regularly emerging. However, the consistent failure of current scFMs to outperform simple baselines on genetic interaction prediction underscores that model scale alone is insufficient—future progress must address fundamental limitations in capturing biological causality and combinatorial complexity.

The emergence of single-cell foundation models (scFMs) has generated significant interest in their potential to predict transcriptional responses to genetic perturbations in silico, a capability with profound implications for basic biology and therapeutic development [46] [2]. However, realizing this potential requires robust, standardized evaluation to separate genuine methodological advancement from optimistic claims. This comparative guide examines three significant benchmarking efforts—PEREGGRN, the benchmark from Nature Methods, and PertEval-scFM—that have independently addressed this critical need. These platforms employ distinct methodologies to answer a central question: can complex machine learning models, particularly deep-learning-based scFMs, reliably outperform simple baselines in predicting perturbation effects? This analysis synthesizes their experimental protocols, findings, and resources to provide researchers with a clear understanding of the current benchmarking landscape and its consensus conclusions.

This section details the core design and experimental protocols of the three major benchmarking platforms, highlighting their unique focus areas and methodological approaches.

PEREGGRN: A Modular Grammar for Gene Regulatory Networks

The PEREGGRN (PErturbation Response Evaluation via a Grammar of Gene Regulatory Networks) framework is built around a central software engine called GGRN (Grammar of Gene Regulatory Networks). Its design is inherently modular, focusing on supervised machine learning to forecast each gene's expression based on candidate regulators [46].

Key Methodology: GGRN can utilize any of nine different regression methods, including simple mean and median dummy predictors. A critical aspect of its training is that samples where a gene is directly perturbed are omitted when training models to predict that same gene's expression, preventing data leakage and enabling training on interventional data [46].
Benchmarking Infrastructure: PEREGGRN provides a curated collection of 11 large-scale perturbation transcriptomics datasets, all human, which have been quality-controlled and uniformly formatted. It also includes a suite of cell type-specific gene networks derived from motif analysis, co-expression, and other approaches [46].
Evaluation Core: Its benchmarking software is highly configurable, allowing users to select datasets, data splitting schemes, and performance metrics. The most critical data split is nonstandard: no perturbation condition is allowed to occur in both training and test sets, ensuring evaluation is performed on truly unseen genetic interventions [46].

Nature Methods Benchmark: Scrutinizing Foundation Models

This benchmark study took a direct approach to evaluating several prominent deep-learning-based models, including the foundation models scGPT and scFoundation, against deliberately simple baselines [2].

Model Scope: It evaluated five foundation models (scGPT, scFoundation, scBERT, Geneformer, UCE) alongside two other deep learning models (GEARS, CPA) designed for perturbation prediction [2].
Simple Baselines: The study employed two straightforward baselines: 1) a "no change" model that always predicts control condition expression, and 2) an "additive" model that for double perturbations predicts the sum of individual logarithmic fold changes [2].
Benchmarking Tasks: The evaluation tested two key claims: a) the ability to predict expression changes after double perturbations using data from Norman et al., and b) the ability to predict effects of unseen single perturbations using datasets from Replogle et al. and Adamson et al. [2].
Linear Model Probe: A key investigative tool was a simple linear model that used gene and perturbation embedding matrices (G and P), which could be either learned from training data or extracted from the pre-trained foundation models, to test the utility of the learned representations themselves [2].

PertEval-scFM: A Standardized Zero-Shot Evaluation Framework

PertEval-scFM is a standardized framework specifically designed for the zero-shot evaluation of single-cell foundation model embeddings for perturbation effect prediction [3] [47] [25].

Primary Focus: Unlike benchmarks that involve fine-tuning, PertEval-scFM assesses the inherent capability of contextualized representations learned by scFMs during pre-training. It benchmarks these embeddings against simpler baseline models to determine if they provide any superior predictive power out-of-the-box [25].
Evaluation Challenges: The framework is designed to test models under conditions of distribution shift and their ability to predict strong or atypical perturbation effects, which are challenging scenarios for any model [25].
Unified Interface: While not part of the core PertEval benchmark itself, frameworks like BioLLM have emerged to support such evaluations by providing a unified interface that integrates diverse scFMs, eliminating architectural and coding inconsistencies and enabling streamlined model switching and consistent benchmarking [8].

The following diagram illustrates the core methodological workflow shared by these benchmarking platforms, from data input to performance evaluation.

Comparative Performance Analysis

A synthesis of results across all three benchmarking platforms reveals a consistent and striking conclusion: current complex methods, including deep-learning-based foundation models, generally fail to outperform simple baseline approaches.

Key Findings from Major Studies

Nature Methods Results: In the double perturbation benchmark, all deep learning models had a prediction error "substantially higher than the additive baseline." When predicting genetic interactions, none of the models performed better than the "no change" baseline [2]. Furthermore, a simple linear model using embeddings from perturbation data outperformed models using embeddings from foundation models pre-trained on single-cell atlas data [2].
PEREGGRN Findings: The study found that "it is uncommon for expression forecasting methods to outperform simple baselines," confirming the general trend observed in other benchmarks [46].
PertEval-scFM Conclusions: This benchmark found that "scFM embeddings do not provide consistent improvements over baseline models, especially under distribution shift." It also highlighted that all models struggle with predicting strong or atypical perturbation effects [25].

Consolidated Performance Table

The table below summarizes the quantitative findings and conclusions across the three benchmarking platforms.

Table 1: Consolidated Performance Findings Across Benchmarking Platforms

Benchmark Platform	Top-Performing Methods	Key Comparative Finding	Performance Context
PEREGGRN [46]	Simple baselines	"Uncommon" for complex methods to outperform simple baselines	Evaluation across 11 human perturbation datasets
Nature Methods [2]	Additive model, Linear model, "No change" model	"None outperformed the baselines"	Double & unseen single perturbation prediction
PertEval-scFM [25]	Simple baseline models	"No consistent improvements" from scFM embeddings	Zero-shot prediction under distribution shift

A critical aspect of these benchmarks is their rigorous experimental design, which includes carefully curated data resources and specific evaluation protocols to ensure fair and biologically meaningful comparisons.

PEREGGRN Data: Provides 11 uniformly formatted human perturbation transcriptomics datasets, focusing on contexts with many perturbed genes. The platform includes extensive quality control, such as removing knockdown or overexpression samples where the targeted transcript did not change as expected [46].
Nature Methods Data: Utilized several key datasets: the Norman et al. data (CRISPRa in K562 cells) for double perturbations, and datasets from Replogle et al. (K562 and RPE1 cells) and Adamson et al. (K562 cells) for unseen single perturbation prediction [2].
PertEval-scFM Data: While specific datasets are not detailed in the available content, the framework is designed for flexible integration of data to test zero-shot prediction under various conditions, including distribution shift [25].

Core Evaluation Metrics

Each platform employs a suite of metrics to comprehensively evaluate performance, recognizing that no single metric perfectly captures prediction utility.

PEREGGRN Metrics: Includes three categories: 1) Common performance metrics (MAE, MSE, Spearman correlation, direction accuracy); 2) Metrics on the top 100 most differentially expressed genes; and 3) Cell type classification accuracy for reprogramming studies [46].
Nature Methods Metrics: Used L2 distance between predicted and observed expression for highly expressed or differentially expressed genes, Pearson delta measure, and true-positive/false-discovery rates for genetic interaction prediction [2].

Table 2: Essential Research Reagents and Resources

Resource Name	Type	Function in Benchmarking	Example Sources/Platforms
Perturbation Datasets	Data	Provide ground-truth transcriptome changes for training & evaluation	Norman et al., Replogle et al., Adamson et al. [2]
Gene Networks	Prior Knowledge	Inform regulatory relationships for model training	Motif-based, co-expression networks [46]
Benchmarking Software	Tool	Standardize evaluation protocols & metrics	PEREGGRN, PertEval-scFM [46] [25]
Unified Model APIs	Tool	Enable consistent model integration & switching	BioLLM framework [8]
Simple Baseline Models	Method	Provide critical performance reference point	"No change", "Additive", Linear models [2]

The consensus across multiple independent, rigorous benchmarks is clear and consistent: despite their theoretical promise and architectural complexity, current deep-learning-based foundation models and specialized expression forecasting methods have not demonstrated superior performance over simple baseline models for predicting genetic perturbation effects. This conclusion holds across various prediction tasks—single and double perturbations, seen and unseen perturbations—and is robust to the choice of evaluation metric [46] [2] [25].

These findings highlight the critical importance of standardized, neutral benchmarking in directing and evaluating methodological development in computational biology. The emergence of platforms like PEREGGRN, PertEval-scFM, and the methodologies in the Nature Methods study provides the community with the tools necessary for rigorous self-assessment. Future progress in the field will depend on acknowledging these results and focusing on developing models that can genuinely capture the biological complexity of gene regulatory systems, rather than merely increasing model parameter counts. The available evidence suggests that pretraining on large-scale perturbation data may be more beneficial than pretraining on single-cell atlas data alone, pointing to a potential pathway for future improvement [2].

Predicting how cells respond to genetic perturbations represents a significant unsolved challenge in functional genomics with profound implications for therapeutic development. Single-cell foundation models (scFMs) pre-trained on vast single-cell atlases enable in silico perturbation (ISP) predictions, simulating cellular state changes without exhaustive experimental validation. However, the true predictive power of these models remains poorly characterized, particularly their ability to generalize beyond systematic variations caused by selection biases or biological confounders. This guide objectively compares the performance of established perturbation response prediction methods through two distinct biological case studies: T-cell activation and RUNX1-Familial Platelet Disorder (RUNX1-FPD). We provide experimental protocols, quantitative performance data, and analytical frameworks to help researchers select and implement appropriate evaluation strategies for perturbation modeling in their own work.

Computational Framework & Evaluation Methodology

Benchmarking scFM Performance

We evaluated established perturbation response prediction methods on their ability to predict transcriptional outcomes of unseen genetic perturbations. The benchmark included three state-of-the-art methods—compositional perturbation autoencoder (CPA), GEARS, and scGPT—alongside two nonparametric baselines capturing average perturbation effects (Perturbed Mean and Matching Mean). Evaluation spanned ten single-cell perturbation datasets from six sources, covering three distinct technologies and five different cell lines, including genome-wide and combinatorial two-gene perturbation screens [15].

Performance was assessed using metrics previously established in literature:

PearsonΔ: Pearson correlation between ground truth and predicted expression changes across all genes
PearsonΔ20: Correlation focusing on top 20 differentially expressed genes
Root mean-squared error (RMSE) between predicted and actual expression changes

The Systema framework was developed to address limitations in standard evaluation approaches. This framework (1) mitigates systematic biases by focusing on perturbation-specific effects and (2) provides interpretable readouts of method ability to reconstruct the perturbation landscape, differentiating predictions that merely replicate systematic effects from those capturing biologically informative responses [15].

The Closed-Loop Innovation

A critical advancement in scFM training involves "closing the loop" by incorporating experimental perturbation data during model fine-tuning. This approach extends scFMs beyond initial pre-training by iteratively refining predictions using observed perturbation outcomes, creating a feedback cycle that significantly enhances biological accuracy [26].

The following diagram illustrates this integrated computational and experimental workflow:

Figure 1: Closed-Loop scFM Framework - Integrating experimental data to improve prediction accuracy

Case Study 1: T-Cell Activation

Experimental Setup & Model Configuration

Biological Context: T-cell activation through CD3-CD28 stimulation or PMA/ionomycin treatment represents a well-characterized biological system with applications in cancer immunotherapy, autoimmunity, and infectious disease. This case study provides a robust benchmark for evaluating perturbation prediction accuracy [26].

Computational Methods:

Base Model: Geneformer-30M-12L (with comparative analysis using Geneformer-106M-12L)
Fine-tuning Data: Single-cell RNA sequencing data from four studies of T-cells stimulated via CD3-CD28 beads or PMA/ionomycin
Perturbation Validation: Existing CRISPRi and CRISPRa screens of >18,000 genes measuring IL-2 and IFN-γ production after CD3-CD28 stimulation provided orthogonal flow cytometry data for validation [26]

ISP Implementation: The fine-tuned model performed ISP across 13,161 genes, simulating both gene overexpression (CRISPRa) and knockout (CRISPRi) to model transcriptional outcomes.

Performance Comparison & Quantitative Results

Table 1: T-Cell Activation Prediction Performance Metrics

Method	Positive Predictive Value	Negative Predictive Value	Sensitivity	Specificity	AUROC
Open-Loop ISP	3%	98%	48%	60%	0.63
Differential Expression	3%	78%	40%	50%	N/A
Closed-Loop ISP	9%	99%	76%	81%	0.86
ISP + DE Overlap	7%	N/A	N/A	N/A	N/A

The benchmarking revealed that open-loop ISP and differential expression analysis identified largely non-overlapping gene sets, with only 2.9% of predictions overlapping between methods. Notably, only 21 genes were predicted by both methods to have effects in the same direction: 11 shifting toward activation and 10 toward resting state. These overlapping genes represented key T-cell activation regulators including IL2RA, VAV1, ZAP70, CD3D, CD3G, and LCP2 [26].

Minimum Data Requirements for Improvement

A key finding was that closed-loop performance improved dramatically with just 10 perturbation examples (sensitivity: 61%, specificity: 66%) and approached saturation at approximately 20 examples (sensitivity: 76%, specificity: 79%). Performance did not improve significantly with additional examples beyond this point, indicating that even modest experimental validation can substantially enhance closed-loop ISP accuracy compared to baseline ISP [26].

Case Study 2: RUNX1-Familial Platelet Disorder

Disease Context & Experimental Models

Clinical Background: RUNX1-FPD is a rare autosomal dominant disorder caused by germline mutations in the RUNX1 gene, characterized by thrombocytopenia, platelet dysfunction, and approximately 44% lifetime risk of hematological malignancies, primarily myelodysplastic syndrome and acute myeloid leukemia [48] [49]. With an estimated 18,000-20,000 affected individuals in the United States, this condition represents a significant unmet medical need as no interventions currently exist to prevent leukemic progression [50] [26].

Experimental Models:

Primary Patient Cells: Bone marrow and peripheral blood samples from >75 RUNX1-FPD patients and >30 healthy donors
Engineered HSCs: Human hematopoietic stem cells engineered with RUNX1 loss-of-function mutations
Mouse Models: Germline Runx1 mutations mimicking those found in FPD patients [50] [26] [51]

Pathophysiological Insights: Single-cell RNA sequencing of FPD bone marrow cells (122,021 FPD vs. 48,781 healthy cells) revealed altered hematopoietic differentiation with increased monocyte and T-cell populations, decreased megakaryocyte-erythroid progenitors, and upregulated inflammatory pathways including TNF-α/NF-κB, IFN-γ response, and TGF-β signaling [50].

CD74 Signaling Axis as Therapeutic Target

Mechanistic investigation identified CD74 as a master regulator elevated in preleukemic RUNX1-FPD, driving inflammation through mTOR and JAK/STAT pathway activation. CD74-mediated signaling was exaggerated in RUNX1-FPD hematopoietic stem and progenitor cells compared to healthy controls, leading to increased cytokine production [50].

The following diagram illustrates the key signaling pathways and therapeutic intervention points:

Figure 2: RUNX1-FPD Signaling & Therapeutic Targeting - Key pathways and intervention strategies

In Silico Target Prioritization & Validation

Computational Target Discovery: Application of the closed-loop framework to RUNX1-FPD identified eight genes with available small molecule inhibitors that could shift RUNX1-knockout HSCs toward a control-like state. From these, four key therapeutic pathways emerged:

mTOR signaling
CD74-MIF signaling axis
Protein kinase C
Phosphoinositide 3-kinase [26]

Experimental Validation: Genetic and pharmacological targeting of CD74 with ISO-1, and its downstream targets JAK1/2 and mTOR with ruxolitinib and sirolimus respectively, reversed RUNX1-FPD differentiation defects in vitro and in vivo and reduced inflammation. These interventions suppressed the exaggerated CD74 signaling, normalized mTOR and JAK/STAT pathway activation, and reduced cytokine production [50].

Table 2: RUNX1-FPD Therapeutic Targets & Experimental Outcomes

Therapeutic Target	Experimental Agent	Experimental Model	Key Outcomes
CD74 Signaling	ISO-1	Primary patient BM cells, in vivo models	Reduced inflammation, reversed differentiation defects
mTOR Pathway	Sirolimus	Primary patient BM cells, in vivo models	Restored megakaryocytic differentiation, reduced cytokine production
JAK/STAT Pathway	Ruxolitinib	Primary patient BM cells, in vivo models	Suppressed inflammatory signaling, improved hematopoietic function
RUNX1 Stabilization	Proteasomal inhibition	Patient-derived iPSCs, AML blood cells	Enhanced RUNX1 levels, improved megakaryocytic differentiation

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Research Materials for Perturbation & FPD Studies

Reagent/Category	Specific Examples	Research Application	Functional Role
scFM Platforms	Geneformer-30M-12L, scGPT, GEARS, CPA	In silico perturbation prediction	Base models for predicting transcriptional responses to genetic perturbations
Genetic Perturbation Tools	CRISPRi/a, Perturb-seq	T-cell activation screens	High-throughput functional genomic screening to validate predictions
RUNX1-FPD Models	Primary patient HSPCs, RUNX1-engineered HSCs, Patient-derived iPSCs	Disease modeling, drug screening	Physiologically relevant systems for studying disease mechanisms and therapies
Therapeutic Compounds	ISO-1, Ruxolitinib, Sirolimus	Target validation, functional rescue	Pharmacological probes for pathway inhibition and therapeutic assessment
Analytical Tools	Systema framework, AUCell, GSEA	Method evaluation, pathway analysis	Benchmarking prediction accuracy and identifying biologically meaningful effects

Discussion & Comparative Analysis

Performance Interpretation in Context

The case studies reveal that standard evaluation metrics can be misleading due to systematic variation—consistent transcriptional differences between perturbed and control cells arising from selection biases or biological confounders. In the Adamson (endoplasmic reticulum homeostasis) and Norman (cell cycle and growth) datasets, systematic differences in pathway activities between perturbed and control cells significantly influenced predictive performance [15].

The closed-loop framework demonstrated substantial improvement across both case studies, with the most significant gains in positive predictive value. This approach effectively addresses the limitation of open-loop predictions merely capturing average perturbation effects rather than perturbation-specific biology.

Biological vs. Technical Generalization

A critical distinction emerges between technical generalization (performance on unseen perturbations within similar biological contexts) and biological generalization (performance across different cell types and disease states). While methods showed reasonable technical generalization in T-cell activation, biological generalization across the hematopoiesis-to-inflammation spectrum of RUNX1-FPD presented greater challenges, highlighting the need for domain-specific fine-tuning and biological context incorporation.

Clinical Translation Potential

For RUNX1-FPD, the identification of the CD74 signaling axis and successful pharmacological targeting with repurposed JAK1/2 and mTOR inhibitors provides a promising near-term therapeutic strategy. The computational prediction of protein kinase C and phosphoinositide 3-kinase as additional targets offers expanded opportunities for intervention [50] [26].

Based on our comparative analysis, we recommend:

Adopt Systema Framework: Implement the Systema evaluation framework or similar approaches that control for systematic variation when benchmarking perturbation prediction methods.
Prioritize Closed-Loop Implementation: Incorporate experimental perturbation data during model fine-tuning, as even 10-20 validated examples can significantly enhance prediction accuracy.
Contextualize Performance Metrics: Interpret predictive performance in the context of systematic variation specific to each biological system and experimental design.
Leverage Cross-Method Consensus: Consider genes identified by both ISP and differential expression analysis as high-confidence targets, as they demonstrate substantially higher positive predictive value.
Validate in Disease-Relevant Models: Employ physiological systems such as primary patient cells and genetically engineered HSCs for target validation, particularly for rare diseases where samples are scarce.

The integration of sophisticated computational prediction with rigorous experimental validation through closed-loop frameworks represents the most promising path toward realizing the potential of "virtual cell" models for biomedical discovery and therapeutic development.

Conclusion

The current state of perturbation effect prediction is one of recalibration. While scFMs represent a significant technological ambition, consistent benchmarking reveals they have not yet surpassed the predictive power of deliberately simple models for core tasks. The critical takeaways are threefold: first, systematic biological and technical variation in datasets poses a major challenge that inflates standard performance metrics; second, new evaluation frameworks like Systema are essential to distinguish true biological insight from data artifacts; and third, the emerging 'closed-loop' approach, which iteratively integrates experimental perturbation data into model fine-tuning, demonstrates a tangible path to substantially improved accuracy. The future of the field hinges on developing more robust models that can genuinely generalize to novel biology, coupled with transparent and rigorous benchmarking. For biomedical research, this evolving capability holds the long-term promise of accelerating therapeutic discovery, particularly for rare diseases where experimental screening is most challenging.