The Promise and Pitfalls of Zero-Shot Single-Cell Foundation Models in Perturbation Prediction

Thomas Carter Nov 27, 2025 488

Single-cell foundation models (scFMs), pretrained on millions of cells, promise to revolutionize the in-silico prediction of cellular responses to genetic and drug perturbations.

The Promise and Pitfalls of Zero-Shot Single-Cell Foundation Models in Perturbation Prediction

Abstract

Single-cell foundation models (scFMs), pretrained on millions of cells, promise to revolutionize the in-silico prediction of cellular responses to genetic and drug perturbations. However, rigorous benchmarking reveals significant limitations in their zero-shot capabilities, where models are used without task-specific fine-tuning. This article synthesizes recent evidence showing that zero-shot scFMs often fail to outperform deliberately simple baselines, struggle with distribution shifts, and offer limited improvements for predicting unseen perturbations. We explore the foundational causes of these shortcomings, survey emerging methodological fixes like efficient fine-tuning, provide a framework for troubleshooting model performance, and outline rigorous validation standards. For researchers and drug development professionals, this critical appraisal provides essential guidance for navigating the current landscape of scFMs, enabling more informed and effective application in perturbation biology and therapeutic discovery.

The Reality Check: Exposing the Fundamental Limits of Zero-Shot scFMs

Troubleshooting Guides & FAQs

Q1: Why does my zero-shot single-cell foundation model (scFM) underperform on basic cell type clustering compared to established methods?

A: Current benchmarking reveals that in zero-shot settings, scFMs like Geneformer and scGPT can be outperformed in cell type clustering by simpler methods, including the selection of Highly Variable Genes (HVG) or using established tools like Harmony and scVI [1]. This is measured by metrics such as the average BIO (AvgBio) score and average silhouette width (ASW) [1]. The underlying issue may be that the masked language model pretraining framework does not inherently produce cell embeddings that are optimal for this specific biological task without further, task-specific fine-tuning [1].

Q2: When predicting genetic perturbation effects, why do complex scFMs fail to beat simple baseline models?

A: Multiple independent studies have found that for predicting transcriptome changes after single or double genetic perturbations, several scFMs (including scGPT and scFoundation) and other deep learning models do not consistently outperform deliberately simple baselines [2] [3]. These baselines include:

The 'additive' model: For a double perturbation, it predicts the sum of the individual logarithmic fold changes observed in single perturbations [2].
The 'perturbed mean' baseline: It always predicts the average expression profile across all perturbed cells [3]. The performance gap suggests that current scFMs may be primarily capturing broad, systematic differences between control and perturbed cells rather than mastering the underlying perturbation biology [3].

Q3: My scFM performs well on batch integration for some datasets but fails on others. What is happening?

A: The performance of scFMs on batch integration is inconsistent. While they may successfully integrate data from different experiments using the same technique, they often struggle to correct for batch effects between different experimental techniques [1]. Quantitative evaluations show that methods like Harmony and scVI frequently outperform scFMs on this task, and in many cases, even simply selecting HVGs can achieve superior batch integration scores [1]. The effectiveness of an scFM can be highly dependent on the specific characteristics of the dataset and the nature of the batch effects.

Q4: Is there a single scFM that consistently outperforms all others across diverse tasks?

A: No. Comprehensive benchmarks indicate that no single scFM consistently outperforms all others across every task [4]. The best model for a given project depends on factors such as dataset size, task complexity, the need for biological interpretability, and available computational resources [4]. Model selection should therefore be a tailored decision based on the specific experimental context and goals.

Quantitative Performance Data

The tables below summarize key findings from recent benchmark studies, providing a direct comparison between scFMs and simpler baseline methods.

Table 1: Performance Comparison on Cell-Level Tasks (Zero-Shot)

Task	Top-Performing Methods	Underperforming Methods	Key Metric(s)	Notes
Cell Type Clustering	HVG, scVI, Harmony [1]	scGPT, Geneformer [1]	AvgBIO, ASW [1]	scFMs show inconsistent performance across different datasets [1].
Batch Integration	HVG, scVI, Harmony [1]	Geneformer, scGPT [1]	Batch mixing scores, PCR [1]	Geneformer often increases batch effect variance compared to input data [1].

Table 2: Performance Comparison on Perturbation Prediction Tasks

Task	Simple Baseline Models	Complex Models Benchmarked	Key Finding	Reference
Double Perturbation Prediction	Additive Model, 'No Change' Model [2]	GEARS, CPA, scGPT, scFoundation, scBERT, Geneformer, UCE* [2]	"All models had a prediction error substantially higher than the additive baseline." [2]	[2]
Unseen Single Perturbation Prediction	Perturbed Mean, Linear Model [3]	CPA, GEARS, scGPT [3]	"Simple baselines performed comparatively or outperformed state-of-the-art methods." [3]	[3]
Unseen Combinatorial Perturbation Prediction	Matching Mean Baseline [3]	GEARS, scGPT [3]	The matching mean baseline "outperformed all other baselines by a considerable margin." [3]	[3]

Experimental Protocols

Protocol for Benchmarking Zero-Shot Cell Embeddings

This protocol is adapted from evaluations of scFM zero-shot capabilities [1].

Embedding Extraction: For the target dataset, obtain cell embeddings from the scFM (e.g., scGPT, Geneformer) without performing any further fine-tuning on the dataset.
Baseline Generation: Generate comparable representations using baseline methods:
- HVG: Select Highly Variable Genes from the dataset.
- scVI: Process the dataset using the scVI model.
- Harmony: Process the dataset using the Harmony integration tool.
Dimensionality Reduction: Apply a standard dimensionality reduction technique (e.g., UMAP) to all embeddings and baseline representations for visualization.
Quantitative Evaluation: Calculate established clustering and batch correction metrics.
- Cell Type Clustering: Use Average BIO (AvgBio) score and Average Silhouette Width (ASW) to assess how well the embeddings separate known cell types.
- Batch Integration: Use batch mixing scores and Principal Component Regression (PCR) to assess the removal of technical batch effects while preserving biological variation.
Visual Inspection: Qualitatively assess the 2D visualizations to check if the primary structure in the embeddings is driven by biology (cell type) or technical artifacts (batch).

Protocol for Benchmarking Perturbation Effect Prediction

This protocol is based on benchmarks comparing scFMs to simple baselines [2] [3].

Data Partitioning: Split the perturbation dataset such that specific perturbations (e.g., a set of single-gene or double-gene perturbations) are held out from the training data entirely. This tests generalization to unseen perturbations.
Model Training & Fine-tuning: Fine-tune the scFMs (e.g., scGPT, scFoundation, GEARS) on the training set of perturbations according to their specified procedures.
Baseline Calculation:
- Perturbed Mean: Calculate the average expression profile across all perturbed cells in the training set.
- Additive Model (for double perturbations): For a double perturbation A+B, predict the expression by summing the log-fold changes of the individual perturbations A and B from the control.
- Matching Mean (for double perturbations): For a double perturbation A+B, predict the expression by averaging the centroid expression profiles of the single perturbations A and B from the training data.
Prediction & Evaluation: On the held-out test perturbations, compare the model and baseline predictions against the ground-truth expression data.
Metrics: Calculate standard metrics, including:
- L2 Distance / RMSE: The root mean-squared error between predicted and observed expression values.
- PearsonΔ: The Pearson correlation between predicted and observed expression changes (deltas) with respect to control for all genes.
- PearsonΔ20: The same as PearsonΔ, but calculated only on the top 20 differentially expressed genes.

Visualization of Evidence for the Performance Gap

Key Evidence for the scFM Performance Gap

Experimental Workflow for Perturbation Benchmarking

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools for scFM Benchmarking

Item Name	Function / Application	Key Insight from Benchmarking
Highly Variable Genes (HVG)	A baseline method for feature selection prior to clustering or integration.	Surprisingly robust; often outperforms or matches scFMs in zero-shot cell type clustering and batch integration tasks [1].
Harmony	Algorithm for integrating single-cell data across multiple batches or experiments.	A strong, established baseline for batch correction that frequently outperforms zero-shot scFM embeddings [1].
scVI	A probabilistic generative model for scRNA-seq data analysis, including integration.	Consistently performs well on cell-level tasks and serves as a powerful benchmark against which to compare new scFMs [4] [1].
'Perturbed Mean' Baseline	A simple model that predicts the average expression profile of all perturbed cells.	Crucial for perturbation benchmarks; reveals that complex models may not capture much beyond this average effect for unseen perturbations [2] [3].
'Additive' Baseline	A model that predicts double perturbation effects as the sum of single perturbation effects.	Essential for evaluating combinatorial perturbation prediction; often outperforms specialized deep learning models [2].
Systema Framework	An evaluation framework designed to control for systematic variation in perturbation data.	Helps distinguish models that capture true perturbation-specific biology from those that merely learn dataset-wide biases [3].

### Frequently Asked Questions (FAQs)

FAQ 1: What is the "Distribution Shift Problem" in the context of single-cell perturbation prediction?

The distribution shift problem refers to the significant performance deterioration that single-cell foundation models (scFMs) exhibit when they encounter strong or atypical genetic perturbations that differ from the data they were trained on. In a zero-shot setting, these models struggle to generalize to these out-of-distribution examples, often failing to accurately predict the transcriptional outcomes of such perturbations [5] [6].

FAQ 2: Why do current single-cell foundation models (scFMs) fail on atypical perturbations?

Benchmarking studies indicate that current-generation scFMs primarily capture systematic variation—the consistent transcriptional differences between pools of perturbed and control cells caused by selection biases or biological confounders—rather than genuine, perturbation-specific effects. When presented with an atypical perturbation that does not share these common systematic patterns, the models lack the specific biological knowledge to make an accurate prediction [7].

FAQ 3: Are there any standardized benchmarks to evaluate this issue?

Yes, the PertEval-scFM framework is a standardized benchmark specifically designed to evaluate models, including their performance on distribution shifts. It assesses whether zero-shot scFM embeddings genuinely enhance perturbation effect prediction compared to simpler baseline models [5] [6].

FAQ 4: What is a key pitfall in evaluating my own perturbation prediction model?

A common pitfall is relying on standard reference-based metrics (like Pearson correlation on differential expression) without accounting for systematic variation. A model can achieve a high score by simply learning the average difference between all perturbed and control cells, which does not reflect its ability to predict the unique effect of a specific, unseen perturbation. The Systema framework is a new evaluation method designed to mitigate this bias [7].

FAQ 5: What are some emerging solutions to improve generalization?

Emerging approaches focus on better integration of biological knowledge and representation learning. For instance:

SynthPert: Enhances Large Language Models (LLMs) by fine-tuning them on high-quality, synthetic chain-of-thought explanations of perturbation mechanisms, which improves cross-cell-type generalization [8].
scREPA: Aligns the internal representations of a prediction model with high-quality external representations from pre-trained scFMs, leading to more robust predictions on unseen conditions and noisy data [9].

### Troubleshooting Guides

Problem: My model's performance drops significantly on unseen or strong genetic perturbations.

Diagnosis: This is a classic symptom of the distribution shift problem. The model is likely overfitting to the systematic variation present in its training data and cannot extrapolate to novel scenarios.

Solution - Implement Rigorous Evaluation: Follow the protocol below to diagnose whether your model is capturing true biological signals or just systematic bias.

Experimental Protocol: Isolating Perturbation-Specific Effects with Systema

Objective: To evaluate a model's ability to predict perturbation-specific effects, free from the confounding influence of systematic variation.

Materials:
- Your trained perturbation prediction model.
- Perturbation dataset with held-out test perturbations (e.g., from Adamson, Norman, or Replogle datasets) [7].
- Baseline models: "Perturbed Mean" and "Matching Mean" [7].
Methodology:
- Benchmark Against Simple Baselines: Compare your model's performance on the held-out test set against the simple non-parametric baselines.
  - The Perturbed Mean baseline predicts the average expression profile across all perturbed cells for every perturbation.
  - The Matching Mean baseline (for combinatorial perturbations) predicts the average of the mean profiles of the two constituent single-gene perturbations [7].
- Quantify Systematic Variation: Use the methods described in Systema [7] to quantify the level of systematic variation in your dataset. This can involve:
  - Gene Set Enrichment Analysis (GSEA): Identify pathways that are consistently enriched when comparing all perturbed cells against all control cells.
  - AUCell: Score the activity of these systematically enriched pathways in single cells to visualize the stark difference between the perturbed and control populations [7].
- Apply the Systema Framework: Evaluate your model using the Systema framework, which emphasizes the model's ability to reconstruct the specific landscape of individual perturbations rather than just the average treatment effect [7].
Interpretation of Results:
- If your complex model performs similarly to or only marginally better than the simple "Perturbed Mean" baseline, it is likely just capturing systematic variation.
- A model that truly generalizes will significantly outperform these baselines under the Systema evaluation, particularly for perturbations that are functionally distinct from the bulk of the training data.

The following diagram illustrates this diagnostic experimental workflow.

Problem: My model lacks biological reasoning for its predictions, hindering trust and utility.

Diagnosis: The model has learned statistical associations but not the underlying mechanistic biology, making it an unreliable tool for hypothesis generation.

Solution - Incorporate Synthetic Biological Reasoning: Use a knowledge distillation approach to infuse biological reasoning into a smaller, more efficient model, as demonstrated by SynthPert [8].

Experimental Protocol: Knowledge Distillation with Synthetic Reasoning Traces

Objective: To enhance a model's biological reasoning capabilities for perturbation prediction through supervised fine-tuning on synthetic chain-of-thought explanations.

Materials:
- A frontier LLM (e.g., OpenAI o4-mini) to generate reasoning traces.
- A critic model (can be another frontier LLM) to grade explanation quality.
- A base LLM for fine-tuning (e.g., DeepSeek-R1 8B) [8].
- The PerturbQA benchmark dataset [8].
Methodology:
- Synthetic Data Generation: For each training data point (cell type, perturbation, gene, outcome), use the frontier model to generate a mechanistic explanation for the observed outcome.
- Quality Control with a Critic Model: Present the generated explanation and the ground truth to a separate critic model. Have the critic grade the explanation on a scale (e.g., 'excellent' to 'terrible'). Retain only the highest-quality explanations (e.g., those graded 'excellent') for training [8].
- Supervised Fine-Tuning: Fine-tune your base LLM not just on the raw (input, output) tuples, but on the (input, reasoning trace, output) sequences. This teaches the model the "how" and "why" behind the prediction.
- Evaluation: Rigorously evaluate the fine-tuned model on a held-out test set, paying special attention to its performance on unseen cell types or strong perturbations.
Interpretation of Results:
- Success is demonstrated by the fine-tuned model achieving state-of-the-art performance and showing strong cross-cell-type generalization, potentially even surpassing the capabilities of the larger frontier model that generated the training data [8].

The workflow for this solution is shown below.

### Performance Benchmarking Tables

Table 1: Benchmarking scFMs against Simple Baselines for Zero-Shot Perturbation Prediction [5] [6] [7]

Model / Baseline	Core Principle	Performance on Unseen Perturbations	Performance under Distribution Shift	Key Limitation
Zero-shot scFMs	Contextualized embeddings from models pre-trained on large scRNA-seq atlases.	Limited improvement over baselines [5] [6].	Significant performance deterioration, especially on strong/atypical perturbations [5] [6].	Captures systematic variation rather than perturbation-specific effects [7].
Perturbed Mean	Non-parametric baseline; predicts the average expression of all perturbed cells.	Surprisingly competitive or superior for unseen one-gene perturbations [7].	Robust, as it represents the average systematic effect.	Cannot predict any perturbation-specific details; only the average treatment effect.
Matching Mean	Non-parametric baseline; for combo perturbation X+Y, averages the mean profiles of X and Y.	Outperforms complex models for unseen two-gene perturbations [7].	Robust for combinations of seen single-gene perturbations.	Relies on having seen the constituent single-gene perturbations.

Table 2: Emerging Methods for Improved Generalization [8] [9]

Model	Core Methodology	Reported Advantage	Applicability
SynthPert	Supervised fine-tuning of LLMs on synthetic, quality-filtered chain-of-thought explanations.	Achieves 87% accuracy on unseen RPE1 cells; state-of-the-art on PerturbQA [8].	Enhances biological reasoning and cross-cell-type generalization.
scREPA	Aligns VAE latent embeddings with biologically meaningful representations from pre-trained scFMs using cycle-consistent alignment.	Outperforms existing methods in predicting DEGs and whole-transcriptome responses; generalizes well to unseen conditions and noisy data [9].	Improves representation quality for robust prediction under data limitations.

Table 3: Essential Resources for scFM Perturbation Prediction Research

Resource Name	Type	Function in Research	Example from Search Results
PertEval-scFM	Software Benchmark	Standardized framework to evaluate and compare the performance of single-cell foundation models for perturbation prediction in a zero-shot setting [5] [6].	https://github.com/aaronwtr/PertEval [5]
Systema	Software Framework / Evaluation Metric	An evaluation framework that mitigates the confounding effects of systematic variation, providing a clearer readout of a model's ability to capture perturbation-specific biology [7].	https://github.com/mlbio-epfl/systema [7]
PerturbQA	Dataset & Benchmark	A benchmark that reformulates perturbation experiments into natural language tuples, enabling the evaluation of LLM-based biological reasoning [8].	Used as the primary evaluation dataset in SynthPert [8].
Adamson & Norman Datasets	Experimental Data	Key single-cell perturbation datasets often used for training and benchmarking. They target specific biological processes but are known to contain significant systematic variation [7].	Used in the Systema benchmark to demonstrate systematic variation [7].
Replogle (RPE1) Dataset	Experimental Data	A large-scale, genome-wide perturbation screen used to study model generalization and artifacts like cell-cycle arrest induced by perturbations [7].	Used in the Systema benchmark to demonstrate cell-cycle systematic bias [7].

Frequently Asked Questions

Q1: What does "zero-shot" performance mean for a single-cell foundation model (scFM), and why is it important? Zero-shot evaluation tests a foundation model's capabilities without any further task-specific training or fine-tuning. You use the model's pre-trained internal representation, or "embedding," directly for downstream analysis. This is critical for exploratory research where predefined labels don't exist or the ability to fine-tune is excluded, such as in discovery settings where the biological outcomes are unknown [1].

Q2: Our team is getting poor results using scGPT and Geneformer for zero-shot perturbation prediction. Are we doing something wrong? Not necessarily. Benchmarking studies have consistently found that these models in a zero-shot setting do not outperform, and are sometimes outperformed by, deliberately simple baseline methods. This appears to be a fundamental limitation of current model architectures and pretraining, not user error [6] [2].

Q3: What are the main types of failures when predicting genetic interactions? Models struggle with several specific scenarios:

Strong or Atypical Effects: Predicting the impact of strong or highly unusual perturbations [6].
Synergistic Interactions: Most models fail to correctly identify or predict synergistic genetic interactions, where the combined effect is greater than the sum of individual effects [2].
Distribution Shifts: Performance degrades when models encounter data that differs significantly from their pretraining sets [6].

Q4: Is there a way to improve the accuracy of these models for our perturbation experiments? Yes, recent research suggests moving from an "open-loop" to a "closed-loop" framework can significantly improve performance. This involves iteratively fine-tuning the foundation model with experimental perturbation data (e.g., from Perturb-seq). This approach has been shown to triple the Positive Predictive Value (PPV) of predictions [10].

Troubleshooting Guides

Issue 1: Poor Zero-Shot Performance on Cell Type Clustering and Batch Integration

Problem: When using scGPT or Geneformer embeddings for tasks like cell type identification or removing batch effects without fine-tuning, the performance is inconsistent and worse than established methods.

Investigation & Diagnosis:

Compare against simple baselines. Always include a method like Highly Variable Genes (HVG) selection in your benchmark. Research shows HVG can outperform foundation model embeddings in both cell type clustering and batch correction tasks [1].
Check for dataset overlap. Verify if your evaluation dataset was part of the model's pretraining corpus. Surprisingly, models do not consistently perform better on datasets they were trained on, but this knowledge is crucial for interpretation [1].
Quantify performance with robust metrics. Use metrics like Average BIO score for clustering and a combination of batch mixing and biological conservation scores (e.g., principal component regression score) for integration [1].

Solution: For zero-shot tasks, rely on proven, simpler methods as your primary baseline.

For Cell Type Clustering: Use Harmony or scVI.
For Batch Integration: Use Harmony, scVI, or start with HVG selection.
Re-evaluate Model Choice: If your workflow depends entirely on zero-shot performance, consider that current foundation models may not be the optimal tool for these specific tasks [1].

Issue 2: Inaccurate Prediction of Genetic Perturbation Effects

Problem: The model's predictions for gene expression changes after single or double genetic perturbations do not match experimental validation data.

Investigation & Diagnosis:

Benchmark against additive and "no change" models. Compare your model's predictions to a simple baseline that sums the individual logarithmic fold changes (the additive model) or one that predicts no change from the control condition. Current foundation models and other deep learning models often fail to outperform these simple baselines [2].
Analyze specific interaction types. Check if your model is systematically failing at certain types of genetic interactions, such as synergistic effects. Many models are biased towards predicting "buffering" interactions and rarely correctly predict synergistic ones [2].
Inspect prediction variability. Ensure the model's predictions vary sufficiently across different perturbations. Some models show surprisingly little variation in their outputs for different perturbations, acting similarly to the "no change" baseline [2].

Solution:

Incorporate a linear baseline. Implement a simple linear model that uses dimension-reducing embeddings of your training data. This can serve as a strong, hard-to-beat baseline [2].
Implement a "Closed-Loop" Framework. Instead of relying on a single round of prediction, use a small set of experimental perturbation data to fine-tune the foundation model.
- Even a small number of perturbation examples (e.g., 10-20) can dramatically improve prediction accuracy [10].
- This iterative process of prediction, experimental validation, and model refinement significantly enhances Positive Predictive Value (PPV), sensitivity, and specificity [10].

Issue 3: Model Fails to Predict Effects of Unseen Perturbations

Problem: The model cannot accurately extrapolate to predict the effects of perturbing a gene that was not present in its fine-tuning dataset.

Investigation & Diagnosis: This is a known weakness. Claims that foundation models can inherently generalize to unseen perturbations through pretraining are not yet fully supported by benchmarks [2].

Solution:

Use a linear model with pretrained embeddings. Extract the gene embedding matrix (G) from scFoundation or scGPT, and a perturbation embedding matrix (P) from a model like GEARS. Use these in a linear model framework, which can perform as well as or better than the models' own complex decoders [2].
Leverage perturbation data for pretraining. If possible, pretrain the linear model's perturbation embedding (P) on existing large-scale perturbation datasets. This has been shown to provide a greater benefit than pretraining on single-cell atlas data alone [2].

Performance Benchmarking Data

Table 1: Zero-Shot Cell Embedding Performance vs. Baselines

This table summarizes the performance of scFM embeddings compared to established methods on common tasks, as measured in independent evaluations. ASW (Average Silhouette Width) and AvgBIO score measure how well cell types are separated, while Batch Score measures how well technical batch effects are removed [1].

Task	Metric	HVG	Harmony	scVI	scGPT	Geneformer
Cell Type Clustering	AvgBIO / ASW	Outperforms	Outperforms	Outperforms	Inconsistent	Underperforms
Batch Integration	Batch Score	Best	Good	Good	Moderate	Underperforms
Key Finding		A simple, established method often provides the most robust zero-shot performance for these tasks.

Table 2: Perturbation Prediction Performance

This table compares the performance of various models against simple baselines for predicting gene expression changes after genetic perturbations. L2 Distance measures the error in predicting expression values, while AUROC (Area Under the Receiver Operating Characteristic Curve) measures the ability to classify genetic interactions correctly [2] [10].

Model / Method	L2 Distance (vs. Additive)	AUROC	Notes
Additive Model	(Baseline)	N/A	Sums individual gene effects. A surprisingly strong benchmark [2].
No Change Model	Higher	0.63*	Predicts control expression. Foundation models can perform similarly [2].
GEARS, scGPT, scFoundation	Higher	<0.63	Do not consistently outperform additive/no-change baselines [2].
Open-loop ISP (Geneformer)	N/A	0.63	PPV of 3%, similar to differential expression [10].
Closed-loop ISP (Geneformer)	N/A	0.86	PPV of 9% (3x improvement) with only ~20 perturbation examples [10].
Key Finding		Deliberately simple models are not outperformed by current complex scFMs for this task.

Experimental Protocols for Key Benchmarks

Protocol 1: Benchmarking Zero-Shot Embeddings

Objective: Evaluate the quality of scFM cell embeddings for cell type clustering and batch integration without any fine-tuning.

Methodology:

Data Preparation: Obtain a labeled dataset with known cell types and batch information (e.g., Pancreas dataset with 5 batches [1]).
Generate Embeddings:
- Process the dataset through the scFM (e.g., scGPT, Geneformer) to extract cell embeddings.
- Process the same dataset through baseline methods: HVG selection, Harmony, and scVI.
Dimensionality Reduction & Visualization: Use UMAP or t-SNE to visualize all embeddings.
Quantitative Evaluation:
- Cell Type Clustering: Calculate metrics like Average Silhouette Width (ASW) and AvgBIO score to quantify cell type separation.
- Batch Integration: Calculate a batch integration score (e.g., using scib.metrics package) to quantify the removal of batch effects while preserving biological variance.
Analysis: Compare the scores of the scFM embeddings against the baselines. A well-performing embedding should score high on cell type separation and low on batch effect retention [1].

Protocol 2: Implementing a Closed-Loop ISP Framework

Objective: Dramatically improve the accuracy of in-silico perturbation predictions by incorporating experimental data.

Methodology:

Base Model Fine-tuning: Fine-tune a pre-trained scFM (e.g., Geneformer) on your single-cell RNA-seq dataset to distinguish relevant cell states (e.g., diseased vs. healthy) [10].
Open-loop ISP: Perform an initial round of in-silico perturbation predictions across the genome using the fine-tuned model.
Experimental Validation: Conduct a targeted Perturb-seq screen on a subset of genes (as few as 10-20) predicted in the previous step.
Closed-loop Fine-tuning: Further fine-tune the model using the scRNA-seq data from the Perturb-seq experiment. The labels should be the cell state (e.g., activated), not the gene perturbed.
Closed-loop ISP: Run a new round of ISP predictions with the refined model. Benchmarking shows this step can triple the Positive Predictive Value [10].

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents for Perturbation Validation Experiments

This table lists key reagents and tools required for experimental validation of computational predictions, a critical step in the closed-loop framework.

Item	Function / Description	Example Use Case
CRISPRa/i System	A system for gene activation (a) or interference (i) to perturb gene function.	Genetically perturbing target genes in primary human T cells to study activation [10].
Perturb-seq Protocol	A single-cell RNA sequencing method that captures the transcriptomic effects of genetic perturbations in pooled screens.	Generating experimental data for fine-tuning foundation models in a closed-loop [10].
ATAC-seq Kit	Assay for Transposase-Accessible Chromatin to map genome-wide chromatin accessibility.	Providing complementary epigenetic data to understand regulatory mechanisms [11].
ChIPmentation Kit	A technology that combines chromatin immunoprecipitation (ChIP) with tagmentation for efficient library prep.	Mapping histone modifications or transcription factor binding sites in low-input samples [12].
Flow Cytometry Assays	Measures protein expression and cytokine production (e.g., IL-2, IFN-γ) at the single-cell level.	Providing orthogonal, non-transcriptomic validation of perturbation effects on cell function [10].

Frequently Asked Questions (FAQs)

Q1: What does "zero-shot" evaluation mean for single-cell foundation models (scFMs), and why is it critical for my research?

A1: Zero-shot evaluation tests a foundation model's performance on a new task or dataset without any additional training (fine-tuning). This is critical for exploratory biology because, in many discovery settings, the biological labels or outcomes you are looking for are unknown, making fine-tuning impossible. A model's zero-shot capability demonstrates its true generalizability and the fundamental biological understanding it gained during pre-training [1].

Q2: My zero-shot scFM embeddings are performing poorly in cell type clustering. What could be the cause?

A2: Recent benchmarks have identified that scFMs like scGPT and Geneformer can underperform simpler methods like Highly Variable Gene (HVG) selection or established integration tools like Harmony and scVI in zero-shot cell type clustering [1]. This suggests that the masked language model pre-training objective used by many scFMs may not automatically produce high-quality cell embeddings for every downstream task without fine-tuning. If you encounter this, consider using a simpler baseline method as a benchmark for your specific dataset [1].

Q3: Can current scFMs accurately predict genetic interaction effects in a zero-shot setting?

A3: Current evidence suggests they cannot. When predicting effects of double genetic perturbations, foundation models and other deep learning models have failed to outperform a deliberately simple "additive" baseline, which just sums the effects of single perturbations. Furthermore, these models struggle to correctly predict synergistic genetic interactions, often defaulting to predicting no interaction [2].

Q4: Are there any scFMs that show promising zero-shot capabilities?

A4: Some newer models are being designed with a stronger focus on zero-shot performance. For example, scShift is a deep identifiable model that, when scaled up, has demonstrated remarkable zero-shot capabilities in characterizing cell types and biological states while overcoming batch effects across datasets [13]. This indicates that model architecture and training objectives are key factors for successful zero-shot application.

Troubleshooting Guides

Problem 1: Poor Batch Integration in Cell Embeddings

Symptoms: When visualizing your scFM cell embeddings, the data clusters strongly by batch or dataset source instead of by biological cell type.

Diagnosis: The model has failed to learn batch-invariant representations of cells in its zero-shot setting.

Solutions:

Benchmark against baselines: Compare your scFM's performance against a simple HVG selection and established batch integration tools like scVI or Harmony. One study found that HVG selection achieved the best batch integration scores across several datasets [1].
Check pre-training data: Investigate if your scFM was pre-trained on data similar to your query dataset. Performance can be variable even on previously seen data, but understanding the pre-training corpus can help diagnose issues [1].

Problem 2: Inaccurate Prediction of Genetic Perturbation Effects

Symptoms: Your model's predictions for gene expression changes after a perturbation are inaccurate, particularly for double-gene perturbations, and are worse than a simple additive model.

Diagnosis: The model has not learned the underlying biological rules that govern genetic interactions.

Solutions:

Use a linear baseline: Before deploying a complex scFM, test its performance against a simple linear model or an "additive" model. If the simple model is superior, it indicates the scFM has not achieved its claimed goal of learning generalizable biological principles [2].
Inspect gene embeddings: If the model allows it, extract its internal gene embeddings. Research has shown that using these embeddings in a simple linear predictor can sometimes perform as well as the model's own complex decoder, suggesting the embeddings themselves may contain useful, though under-utilized, information [2].

Quantitative Performance Data

Table 1: Zero-shot Performance Comparison on Cell Type Clustering (AvgBIO Score) [1]

Model / Method	Pancreas Dataset	PBMC (12k) Dataset	Tabula Sapiens Dataset	Immune Dataset
HVG (Baseline)	0.671	0.620	0.672	0.625
scVI (Baseline)	0.659	0.621	0.653	0.581
Harmony (Baseline)	0.622	0.615	0.579	0.549
scGPT	0.581	0.649	0.651	0.545
Geneformer	0.551	0.556	0.502	0.508

A higher AvgBIO score indicates better cell type separation. HVG often outperforms or matches foundation models.

Table 2: Performance on Double Genetic Perturbation Prediction (L2 Distance) [2]

Model / Method	Prediction Error (L2 Distance)	Outperforms Additive Baseline?
Additive Baseline	~0.28	N/A
No Change Baseline	~0.40	No
GEARS	~0.38	No
scGPT	~0.42	No
Geneformer*	~0.45	No
scBERT*	~0.43	No

Models marked with * were repurposed with a linear decoder. Lower L2 distance is better. No model outperformed the simple additive baseline.

Experimental Protocols

Protocol 1: Benchmarking Zero-Shot Cell Type Clustering

Objective: To evaluate the quality of cell embeddings generated by an scFM for separating known cell types without fine-tuning.

Materials: A labeled single-cell dataset with known cell types (e.g., a subset of Tabula Sapiens). The pre-trained scFM model (e.g., scGPT, Geneformer). Baseline methods (HVG, scVI, Harmony).

Methodology:

Generate Embeddings: Input your dataset into the scFM and extract the cell embeddings without performing any fine-tuning.
Run Baselines: Generate cell embeddings or reduced-dimensionality representations using your chosen baseline methods (e.g., select HVGs and perform PCA).
Cluster and Score: Use a clustering algorithm (e.g., Leiden, K-means) on all embeddings. Calculate clustering metrics like Average BIO (AvgBIO) score or Average Silhouette Width (ASW) to quantify how well the clusters correspond to the known cell types.
Compare: Compare the scores achieved by the scFM against the baseline methods [1].

Protocol 2: Evaluating Perturbation Effect Prediction

Objective: To test an scFM's ability to predict transcriptome-wide gene expression changes caused by single or double genetic perturbations.

Materials: A perturbation dataset (e.g., Norman et al. or Replogle et al. data). The scFM (e.g., scGPT, scFoundation). Baselines (Additive model, No-change model, simple linear model).

Methodology:

Data Splitting: For double perturbation prediction, fine-tune the model on all single perturbations and a portion of the double perturbations. Hold out the remaining double perturbations for testing.
Generate Predictions: Use the model to predict the gene expression values for the held-out test perturbations.
Calculate Error: For each prediction, compute the L2 distance between the predicted and observed expression values for the top 1,000 highly expressed genes.
Compare to Baselines: Compare the model's average prediction error against the errors from the simple additive and no-change baselines [2].

Visualizations

Diagram 1: Workflow for Zero-Shot Benchmarking of scFMs

Diagram 2: Simple Baselines Outperform scFMs in Perturbation Prediction

The Scientist's Toolkit: Key Research Reagents & Solutions

Table 3: Essential Resources for Zero-Shot scFM Evaluation

Item	Function in Evaluation
Pre-trained Model Weights (e.g., for scGPT, Geneformer)	Provides the foundational model to be tested in a zero-shot context without further training [1] [2].
Benchmarking Datasets (e.g., Norman et al. perturbation data, Tabula Sapiens)	Serves as the standardized, ground-truthed test bed for evaluating model performance on specific tasks like perturbation prediction or cell type identification [1] [2].
Simple Baseline Models (e.g., Additive model, HVG selection, linear models)	Critical controls to determine if the complexity of an scFM provides any tangible benefit over simple, established methods [1] [2].
Quantitative Metrics (e.g., L2 distance, AvgBIO score, ASW)	Provides objective, numerical measures of model performance for tasks like prediction accuracy and cluster quality, enabling direct comparison between models [1] [2].
Integration Tools (e.g., Harmony, scVI)	Established methods for comparison against scFMs for tasks like batch correction and dimensionality reduction [1].

Beyond Zero-Shot: Methodological Advances for Practical Application

Frequently Asked Questions (FAQs)

Q1: What is Parameter-Efficient Fine-Tuning (PEFT) and why is it important for large language models (LLMs)?

PEFT refers to a set of techniques that adapt a large pre-trained model to a new task by training only a small number of parameters, rather than the entire model. This is crucial because LLMs can have billions of parameters, making full fine-tuning computationally expensive, time-consuming, and prone to overfitting, especially on smaller datasets. PEFT methods, such as adapters, achieve performance comparable to full fine-tuning while dramatically reducing computational costs and storage requirements [14] [15].

Q2: How do adapter layers work, and where are they inserted in a transformer model?

Adapters are small, bottleneck-shaped neural network modules inserted into the layers of a pre-trained transformer model. A typical adapter consists of two fully connected layers with a non-linear activation in between. The first layer projects the input down to a lower dimension (the bottleneck), and the second layer projects it back up to the original input dimension [14] [15]. In the original adapter method proposed by Houlsby et al. (2019), two adapter layers are inserted into each transformer block: one after the multi-head attention module and one after the feed-forward network [14].

Q3: What are the primary advantages of using adapters over full fine-tuning?

Reduced Computational Cost: Requires fewer GPUs and less training time [15].
Lower Hardware Barrier: Enables fine-tuning of large models on consumer-grade GPUs with less VRAM [15].
Modularity and Storage Efficiency: The large base model remains frozen and can be shared across multiple tasks. Only the small adapter weights need to be saved for each new task, reducing storage overhead [14] [15].
Mitigated Catastrophic Forgetting: By keeping the original model parameters frozen, adapters help prevent the model from forgetting its general knowledge acquired during pre-training [15].

Q4: In the context of single-cell biology, what is a key limitation of foundation models that PEFT could help address?

Recent rigorous benchmarks have revealed that single-cell foundation models (scFMs), like scGPT and Geneformer, often fail to outperform simple baseline models in zero-shot settings for predicting genetic perturbation effects [2] [1] [5]. This means that using these models "out-of-the-box" without any further training yields unreliable results. PEFT, through methods like adapter tuning, provides a pathway to specialize these general scFMs on specific, high-quality perturbation datasets, potentially bridging this performance gap without the cost of full fine-tuning.

Q5: How does the parameter efficiency of adapters compare to simply fine-tuning the top layers of a model?

Adapters can achieve superior performance with a comparable or even smaller number of trained parameters. For example, a BERT model trained with adapters matched the performance of a fully fine-tuned model while only training 3.6% of the parameters. In a direct experiment with a DistilBERT model, fine-tuning adapter layers outperformed fine-tuning only the top two layers on a sentiment classification task, despite using a similar number of parameters (599,424 for adapters vs. 592,130 for the top layers) [14].

Troubleshooting Guides

Issue 1: Poor Task Performance After Adapter Tuning

Problem: Your model, after adapter tuning, is not achieving the expected performance on the downstream task.

Potential Causes and Solutions:

Cause: Inadequate Bottleneck Dimension
- Solution: The bottleneck dimension of the adapter is a key hyperparameter. A dimension that is too small may not provide enough capacity for the task, while one that is too large defeats the purpose of efficiency. Experiment with different sizes (e.g., 8, 16, 32, 64) [14].
Cause: Data Quality or Mismatch
- Solution: This is particularly critical when fine-tuning scFMs for perturbation prediction, as benchmarks show they struggle with strong or atypical perturbations [2]. Ensure your fine-tuning dataset is of high quality and covers the specific biological context or perturbation types you are targeting.
Cause: Suboptimal Training Hyperparameters
- Solution: Even though fewer parameters are being trained, the learning rate and number of epochs still need to be tuned. A learning rate that is too high can cause instability, while one that is too low can lead to slow or insufficient convergence.

Issue 2: Unexpectedly High Memory Usage During Training

Problem: Your GPU memory usage is still high even though you are using adapters.

Potential Causes and Solutions:

Cause: Large Batch Size
- Solution: Reduce the batch size. While adapters reduce the number of trainable parameters, the activations from the large base model still need to be stored in memory during the forward and backward passes. A smaller batch size reduces memory pressure from activations [15].
Cause: Large Base Model
- Solution: Combine adapters with model quantization techniques, such as QLoRA. This involves loading the base model in a lower-bit precision (e.g., 4-bit) while training the adapters in higher precision (e.g., 16-bit), offering further significant memory savings [15].

Experimental Protocols & Data

Protocol: Implementing and Tuning Adapters in a Transformer

This protocol outlines the steps to insert and train adapter layers in a transformer-based model like DistilBERT, based on the experiments by Sebastian Raschka [14].

Materials:

Pre-trained Model: A pre-trained transformer model (e.g., distilbert-base-uncased from Hugging Face).
Dataset: A labeled dataset for a downstream task (e.g., the movie review dataset for sentiment classification).
Software: PyTorch or TensorFlow, and the Transformers library.
Hardware: A GPU with sufficient VRAM (e.g., a 16GB T4 GPU can be sufficient for models like DistilBERT).

Methodology:

Model Loading: Load the pre-trained model and its tokenizer.
Adapter Module Definition: Define a function to create an adapter module. This is typically a sequential module with a down-projection, a non-linearity (e.g., GELU), and an up-projection.
Adapter Insertion: Iterate through the transformer blocks of the model and insert the adapter layers at the desired locations (e.g., after the attention output and after the feed-forward network).
Freeze Base Model: Freeze all parameters of the original pre-trained model.
Training Loop: Train the model on the downstream task. Only the parameters in the adapter layers will be updated.

Protocol: Benchmarking Fine-Tuning Methods

This protocol describes how to compare different fine-tuning strategies, as performed in [14].

Methodology:

Establish Baselines:
- Full Fine-Tuning: Train all parameters of the base model.
- Top-Layer Tuning: Freeze all but the last few classification layers of the model and train only those.
Test PEFT Methods:
- Adapter Tuning: Insert and train only the adapter layers, as described above.
- Other PEFT Methods: Compare against other techniques like LoRA or prompt tuning [15].
Evaluation: Use a fixed validation or test set to evaluate all methods on the same metrics (e.g., accuracy, F1-score).
Metrics Comparison: Record and compare the performance, number of trainable parameters, and training time for each method.

Quantitative Performance Comparison of Fine-Tuning Methods

The table below summarizes results from a sentiment classification task using a DistilBERT model, comparing different fine-tuning strategies [14].

Table 1: Comparison of Fine-Tuning Methods on DistilBERT

Fine-Tuning Method	Trainable Parameters	Test Accuracy	Training Time (min)
Top Layers Only	592,130	86.4%	2.89
Adapters (Bottleneck=32)	599,424	88.4%	5.69
Full Fine-Tuning	~66.9 Million	93.0%	7.12

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Components for Adapter-based Fine-Tuning Experiments

Item	Function	Example / Note
Pre-trained LLM	The foundation model that provides general language or biological knowledge.	Models like DistilBERT, LLaMA, or single-cell models like scGPT [14] [1].
Task-Specific Dataset	The labeled data used to adapt the model to a new domain or task.	For scFMs, this would be a high-quality dataset of genetic perturbations [2].
Adapter Modules	The small, trainable networks inserted into the base model.	Bottleneck architecture with a configurable hidden dimension [14].
Deep Learning Framework	Software used to implement and train the model.	PyTorch or TensorFlow with the Hugging Face Transformers library.
GPU Acceleration	Hardware to handle the computational load of training and inference.	Consumer GPUs (e.g., 16GB T4) are often sufficient for adapter tuning [15].

Workflow and Architecture Diagrams

Adapter Architecture in a Transformer Block

Adapter Tuning Experimental Workflow

Single-Cell FM Benchmarking Context

Incorporating External Biological Knowledge to Enhance Predictions

Single-cell foundation models (scFMs), such as Geneformer and scGPT, are pre-trained on massive single-cell transcriptomics datasets with the goal of learning universal biological patterns. A primary application is in silico perturbation (ISP) prediction, where these models forecast how a cell's transcriptome changes in response to a genetic intervention. In discovery research where labels are unknown, models must often operate in a zero-shot setting without task-specific fine-tuning.

Recent rigorous benchmarking, however, has revealed a significant limitation: the zero-shot performance of these scFMs for perturbation prediction frequently fails to outperform deliberately simple baselines [1] [2] [6]. This technical support guide addresses this performance gap by providing actionable strategies for incorporating external biological knowledge to enhance prediction reliability.

FAQs & Troubleshooting Guides

Q1: Why do my model's zero-shot perturbation predictions underperform simple baselines?

This is a commonly reported issue. Quantitative benchmarks have demonstrated that even state-of-the-art scFMs do not consistently outperform simple models like an "additive" baseline (summing individual logarithmic fold changes) or predicting no change from the control condition [2].

Root Cause Analysis:
- Trivial Mechanisms: The masked language model pre-training framework may not inherently produce cell embeddings that are optimal for perturbation tasks. Models can sometimes rely on superficial patterns rather than learning causal biological relationships [1].
- Distribution Shift: Models struggle significantly when making predictions on data that differs from their pre-training distribution [6].
- Knowledge Gap: Pre-training on general transcriptome data may not capture the specific dynamics of gene regulatory networks under perturbation.
Solution: Implement a "Closed-Loop" Fine-Tuning Framework.
- Description: Move beyond "open-loop" ISP by incorporating a limited set of experimental perturbation data into the model's training regimen. This uses experimental results as feedback to guide the model toward more accurate predictions [10].
- Protocol:
  - Start with a pre-trained scFM (e.g., Geneformer).
  - Fine-tune the model on a composite dataset that includes:
    - Standard single-cell RNA sequencing (scRNA-seq) data from your biological context (e.g., resting and activated T-cells).
    - scRNA-seq data from Perturb-seq (or similar) experiments, even if from a limited number of gene perturbations.
  - Use this refined model to perform ISP on a broader set of genes [10].
Evidence: In a T-cell activation study, this approach increased the positive predictive value (PPV) of ISP three-fold, from 3% to 9%, while also improving sensitivity and specificity [10].

Q2: How can I improve predictions for a rare disease with limited patient data?

This scenario is challenging due to sample scarcity, but external knowledge can be leveraged.

Root Cause: scFMs require sufficient contextual data to make meaningful predictions. For rare diseases, the model may not have encountered enough relevant patterns during pre-training.
Solution: Utilize Engineered Cell Models and Cross-Validation.
- Description:
  - Create an in vitro model of the disease by engineering relevant mutations (e.g., RUNX1 loss-of-function for a platelet disorder) in human stem cells.
  - Validate that the engineered cells closely mirror the transcriptomic profile of actual patient-derived cells.
  - Fine-tune the scFM to distinguish between the engineered disease-model cells and isogenic control cells.
  - Perform ISP to identify genes whose perturbation shifts the disease state toward a healthy state [10].
Example Workflow for a Rare Hematologic Disorder:
- Model Creation: Engineer RUNX1-knockout in human hematopoietic stem cells (HSCs).
- Validation: Confirm high concordance with patient HSC transcriptomes (e.g., reduced expression of known RUNX1 targets).
- Fine-tuning: Specialize Geneformer to classify RUNX1-knockout vs. control HSCs.
- Prediction & Prioritization: Run ISP and cross-reference results with differential expression analysis to select high-confidence therapeutic targets for experimental validation [10].

Q3: My model fails to predict strong or atypical perturbation effects. What can I do?

This is a known weakness of current scFMs, as they tend to be biased towards predicting minimal changes [6].

Root Cause: The models may be averaging over possible outcomes or are not trained on sufficient examples of strong genetic interactions.
Solution: Integrate Lineage-Specific Gene Embeddings and Prioritize Data Quality.
- Description:
  - Leverage External Embeddings: Instead of relying solely on the model's internal representations, use gene embeddings pre-trained on large-scale perturbation datasets (e.g., from another cell line). These embeddings can be integrated into a simpler linear model that has demonstrated competitive performance [2].
  - Curate Training Data: Rigorously quality-control your perturbation datasets. Exclude perturbations that do not affect their target gene's expression, as these may represent failed experiments that confound the model [2].
  - Focus on Data Breadth: The key to predicting atypical effects may lie not in more complex models, but in higher-quality and more diverse datasets that capture a wider range of cellular states and strong perturbation effects [6].

Q4: How can I biologically validate my scFM's internal representations?

It is crucial to verify that the model's latent space captures biologically meaningful relationships.

Root Cause: Without validation, it's unclear if the model has learned relevant biology or just technical artifacts.
Solution: Use Ontology-Informed Metrics.
- Description: Move beyond standard clustering metrics. Implement novel metrics that evaluate the biological plausibility of the model's outputs.
- Recommended Metrics:
  - scGraph-OntoRWR: Measures the consistency between cell-type relationships in the embedding space and established biological knowledge from cell ontologies.
  - Lowest Common Ancestor Distance (LCAD): For cell type annotation tasks, this metric assesses the severity of misclassification by measuring the ontological proximity between the predicted and true cell type. A smaller LCAD indicates a less severe error (e.g., confusing two T-cell subtypes vs. confusing a T-cell with a neuron) [4].

Quantitative Performance Data

Table 1: Benchmarking scFMs against simple baselines for double perturbation prediction. Prediction error is measured as L2 distance on top 1,000 genes (lower is better). Adapted from [2].

Model / Baseline	Prediction Error (L2)	Outperforms Additive Baseline?
Additive Model (Simple Baseline)	~1.5	(Baseline)
No Change Model (Simple Baseline)	~4.5	No
scGPT	~4.5	No
Geneformer*	~4.2	No
scBERT*	~4.5	No
UCE*	~4.5	No
GEARS	~3.8	No
scFoundation	~3.2	No

Note: Models marked with * were repurposed with a linear decoder for this task.

Table 2: Impact of "closed-loop" fine-tuning on perturbation prediction accuracy for T-cell activation. PPV: Positive Predictive Value; NPV: Negative Predictive Value. Data from [10].

Fine-Tuning Approach	PPV	NPV	Sensitivity	Specificity
Open-Loop (Standard) ISP	3%	98%	48%	60%
Differential Expression	3%	78%	40%	50%
Closed-Loop ISP	9%	99%	76%	81%

Experimental Protocols

Protocol 1: Closed-Loop Fine-Tuning for Enhanced ISP

This methodology details how to incorporate experimental perturbation data to improve a pre-trained scFM [10].

Base Model Selection: Begin with a pre-trained foundation model (e.g., Geneformer-30M-12L).
Task Specialization Fine-tuning: Fine-tune the model on your target biological state (e.g., classify resting vs. activated T-cells) using relevant scRNA-seq data. This creates a specialized "open-loop" model.
Incorporate Perturbation Data:
- Gather scRNA-seq data from a perturbation screen (e.g., Perturb-seq) in the same biological context. The data should be labeled with the resulting cellular state (e.g., activated/rested), but not necessarily with the perturbed gene's identity.
- Combine this perturbation dataset with the initial state-defining data.
Closed-Loop Fine-tuning: Further fine-tune the "open-loop" model on this combined dataset.
ISP Execution: Use the resulting "closed-loop" model to perform in silico perturbations on genes not included in the perturbation training data.

Closed-loop fine-tuning workflow

Protocol 2: Target Prioritization for a Rare Disease

This protocol outlines a strategy to overcome data scarcity in rare disease research [10].

Generate Disease Model: Engineer a loss-of-function mutation in the gene of interest (e.g., RUNX1) in appropriate human primary cells (e.g., HSCs).
Transcriptomic Validation: Sequence the engineered cells and validate them by confirming that known downstream pathways of the target gene are differentially expressed, matching patterns observed in scarce patient data.
Model Fine-tuning: Fine-tune the scFM to distinguish between the disease-model cells and isogenic control cells.
Cross-Method Prediction:
- Perform ISP using the fine-tuned model to get a list of candidate genes.
- Independently, perform a standard differential expression (DE) analysis between disease and control models.
Target Prioritization: Select genes that are identified as significant by both ISP and DE analysis. This consensus approach increases confidence in the predictions.

Rare disease target prioritization

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential resources for enhancing scFM perturbation predictions.

Reagent / Resource	Function in Context	Example Use Case
Geneformer-30M-12L	A pre-trained scFM based on the Transformer architecture. Can be fine-tuned for specific tasks.	Base model for closed-loop fine-tuning in T-cell activation and rare disease modeling [10].
Perturb-seq Data	Single-cell RNA sequencing data from genetic perturbation screens. Provides ground-truth data on transcriptional outcomes.	Incorporated during fine-tuning to teach the model the causal links between gene perturbation and cell state [10].
Engineered Cell Models	In vitro models of disease created via CRISPR/Cas9 editing. Bypasses the need for large numbers of patient samples.	Used to generate abundant, relevant transcriptomic data for rare diseases like RUNX1-FPD [10].
Cell Ontologies	Structured, controlled vocabularies for cell types. Define the hierarchical relationships between different cell classes.	Used to compute biology-aware validation metrics like scGraph-OntoRWR and LCAD [4].
Linear Model with Embeddings	A simple, interpretable baseline model that uses pre-trained gene/perturbation vectors.	Serves as a strong benchmark; can outperform complex scFMs in predicting unseen perturbations [2].
RUNX1-FPD Model	A specific engineered model for RUNX1-familial platelet disorder using human HSCs.	Used to identify therapeutic targets (e.g., mTOR, CD74-MIF axis) via ISP [10].

Frequently Asked Questions

Q1: My zero-shot perturbation predictions are outperformed by a simple "no change" baseline. What could be wrong? This is a known limitation identified in recent benchmarks [2]. The "no change" baseline, which always predicts expression identical to the control condition, and the "additive" baseline, which sums individual logarithmic fold changes, have been found to be highly competitive. Current foundation models often struggle to learn representations that generalize better than these simplistic assumptions for predicting unseen perturbation effects [2] [6].

Q2: How can I improve my model's prediction of genetic interactions from double perturbations? Benchmarks reveal that models frequently misclassify interaction types, often predicting "buffering" interactions but rarely correctly identifying "synergistic" or "opposite" interactions [2]. If your model shows this behavior, it may not be capturing the underlying biological complexity. Consider enriching your training data with confirmed interaction examples or exploring alternative model architectures that move beyond current foundation model limitations.

Q3: Can pretrained gene embeddings from large models enhance prediction for unseen single-gene perturbations? Evidence suggests limited benefits. A linear model using embeddings from scFoundation or scGPT did not consistently outperform a linear model with embeddings derived directly from the training data [2]. The most effective strategy identified was pretraining the perturbation embedding matrix (P) on existing large-scale perturbation data (e.g., from a different cell line), which provided more predictive power than atlas-scale single-cell pretraining [2].

Q4: My model works well on the training data but fails to generalize. What steps should I take? This indicates poor out-of-distribution performance, a common challenge. First, implement the simple linear baseline and "mean prediction" baseline to quantify the performance gap [2]. Ensure your training data encompasses a wide range of perturbation strengths and types, as models tend to struggle with strong or atypical effects [6]. Also, verify that your dataset does not suffer from the biases often present in public drug combination databases [16].

Troubleshooting Guides

Issue: Poor Zero-Shot Performance on Unseen Perturbations

Problem Model fails to accurately predict transcriptome changes for genetic perturbations or drug combinations not seen during training.

Investigation & Diagnosis

Benchmark Against Baselines: Compare your model's performance against these simple baselines [2]:
- No Change Baseline: Always predicts the control condition expression.
- Additive Baseline: For double perturbations, predicts the sum of the individual logarithmic fold changes.
- Mean Prediction Baseline: For unseen single perturbations, predicts the average expression across the training set perturbations.
Check Embedding Utility: If using pretrained gene or drug embeddings, test their effectiveness by plugging them into a simple linear model (see Experimental Protocol 2). Their performance may be limited [2].

Solution If outperformed by baselines, consider:

Leverage Perturbation Data: Pretrain your perturbation embeddings on existing large-scale perturbation datasets from related contexts (e.g., a different cell line) [2].
Reframe the Problem: For drug synergy, consider sequential model optimization (SMO) frameworks like RECOVER, which actively select informative experiments, enriching for synergistic combinations without requiring exhaustive screening [16].

Issue: Inaccurate Prediction of Genetic Interactions in Double Perturbations

Problem Model cannot correctly identify or classify genetic interactions (e.g., synergistic, buffering).

Investigation & Diagnosis

Calculate Ground Truth Interactions: From your data, identify true genetic interactions by finding double perturbations where the phenotype differs from the additive expectation more than expected under a normal distribution null model [2].
Generate Interaction Predictions: Compute the difference between your model's predicted expression and the additive expectation for each double perturbation.
Create ROC Curves: For all possible prediction thresholds, compute the true-positive rate and false discovery proportion to visualize performance [2].

Solution

If the model's performance curve is similar to or worse than the "no change" baseline, this aligns with published findings, and current architectures may be insufficient [2].
Focus on improving the model's capacity to represent non-linear biological relationships beyond simple additive effects.

Table 1: Benchmarking Results for Double Perturbation Prediction (based on Norman et al. data) [2]

Model / Baseline	Prediction Error (L2 Distance)	Performance in Genetic Interaction Prediction
Additive Baseline	Reference	Does not predict interactions by definition
No Change Baseline	Higher than Additive	Not better than random
scGPT	Higher than Additive	Not better than random
GEARS	Higher than Additive	Not better than random
Geneformer	Higher than Additive	Not better than random
scBERT	Higher than Additive	Not better than random

Table 2: Performance of Models on Unseen Single Perturbations [2]

Model / Approach	Performance on Adamson (K562) & Replogle (K562, RPE1) Data
Mean Prediction Baseline	Competitive, often not outperformed
Linear Model (P from training data)	Competitive
scGPT (with its own decoder)	Did not consistently outperform baseline
GEARS (with its own decoder)	Did not consistently outperform baseline
Linear Model (with scGPT's G, training P)	Outperformed scGPT's native decoder
Linear Model (P pretrained on Replogle data)	Consistently outperformed all other models

Experimental Protocols

Protocol 1: Benchmarking Against Simple Baselines for Double Perturbations

Objective: Quantify if a complex model provides value over simple baselines [2].

Materials: Dataset with single and double perturbation phenotypes (e.g., log-transformed expression values).

Method:

Data Preparation: Split double perturbations into training and test sets.
Baseline Predictions:
- No Change: For any test double perturbation, predict the control condition expression values.
- Additive: For a double perturbation of genes A and B, predict: LFC(A+B) = LFC(A) + LFC(B), where LFC is the logarithmic fold change versus control.
Model Training & Prediction: Fine-tune your model on the training set and generate predictions for the test set.
Evaluation: Calculate the L2 distance between predicted and observed expression values for the top 1,000 highly expressed genes. Compare your model's error to that of the baselines.

Protocol 2: Linear Model with Embeddings for Unseen Perturbation Prediction

Objective: Test the predictive utility of pretrained embeddings using a simple, interpretable model [2].

Materials:

Data matrix Y_train (genes x perturbations) for training.
Pretrained gene embedding matrix G (optional).
Pretrained perturbation embedding matrix P (optional).

Method:

Embedding Generation: If G or P are not provided, create them via dimension reduction (e.g., PCA) on the training data.
Model Fitting: Solve the linear equation: Y_train ≈ (G * W * P^T) + b.
- G: Gene embedding matrix (number of genes x K dimensions).
- P: Perturbation embedding matrix (number of perturbations x L dimensions).
- W: The learned weight matrix (K x L).
- b: The vector of row means from Y_train.
Prediction: For a new perturbation, use its embedding p_new (from P or a lookup) to predict gene expression: y_pred = (G * W * p_new^T) + b.

Experimental Workflow and Pathway Diagrams

Experimental Design Flow

Zero Shot Prediction Pipeline

Evidence for ScFM Limitations

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools and Datasets

Item Name	Function / Description	Relevance to Troubleshooting
Simple Linear Baseline Models	Provides a critical performance benchmark for any complex model.	Confirms if a complex model adds value; essential for diagnosing poor performance [2].
Perturbation Datasets (e.g., Norman, Replogle)	Standardized, publicly available datasets for training and benchmarking.	Allows for reproducible benchmarking and comparison of model performance against published results [2].
Gene & Perturbation Embeddings (G, P)	Low-dimensional representations of genes and perturbations.	Their predictive utility can be tested in a linear model framework to isolate embedding quality from architecture complexity [2].
Sequential Model Optimization (SMO) Framework	An active learning approach that selects the most informative experiments to run next.	Efficiently explores large drug combination spaces, enriching for synergistic hits with minimal experimentation [16].
Large Language Model (LLM) Embeddings	Context-enriched embeddings for drugs and cell lines generated from models like GPT-3.5.	Can be used as input features to represent drugs and cell lines in a unified pipeline for tasks like drug synergy prediction [17].

Troubleshooting Guides and FAQs

Common Problem: Poor Model Performance on Novel Perturbations

Q: My foundation model for predicting chemical perturbation effects performs poorly on novel compounds or cell lines it wasn't trained on. What could be wrong?

A: This is a known limitation in current single-cell foundation models (scFMs). Recent benchmarking studies show that even advanced models like scGPT and Geneformer often fail to outperform simple baselines in zero-shot settings—where models are used without any further training on new data [2] [1]. Performance issues are particularly pronounced when predicting effects for unseen single or double perturbations [2].

Diagnosis Steps:
- Benchmark against simple baselines: Compare your model's performance against a deliberately simple baseline, such as an additive model that predicts the sum of individual logarithmic fold changes (LFCs) for perturbations [2].
- Check for dataset overlap: Verify whether your evaluation dataset was part of the model's original pretraining data. Models may not perform well even on previously seen data, but performance can be more variable on truly novel data [1].
- Evaluate embedding utility: Test if the pretrained gene or cell embeddings from the foundation model (e.g., from scGPT or scFoundation) provide any benefit when used with a simple linear predictor. Research indicates these embeddings may offer little to no advantage over random embeddings or those derived from the training data itself [2].
Solution Steps:
- Implement a linear baseline: Use a simple linear model as a performance benchmark. This model can be formulated to use embeddings for genes (G) and perturbations (P), solving for a matrix W that minimizes prediction error [2].
- Incorporate perturbation data in pretraining: If possible, pretrain your model or its embeddings on relevant perturbation data, as this has been shown to increase predictive performance more than pretraining on single-cell atlas data alone [2].
- Consider alternative architectures for chemical perturbations: For predicting responses to novel chemical perturbations, consider specialized models like PRnet. This is a perturbation-conditioned deep generative model designed to generalize to unseen compounds by using their SMILES string representations, and it has shown better performance in this specific domain [18].

Common Problem: Inaccurate Prediction of Genetic Interactions

Q: My model is unable to accurately predict non-additive genetic interactions (like synergy or buffering) from double perturbation data. How can I improve this?

A: Many deep learning models struggle to correctly identify true genetic interactions, often exhibiting a strong bias towards predicting "buffering" interactions and rarely correctly predicting synergistic effects [2].

Diagnosis Steps:
- Analyze prediction patterns: Classify your model's predictions for double perturbations into interaction types (e.g., buffering, synergistic, opposite). Compare the frequency of predicted synergistic interactions against the ground truth; a very low rate of correct synergistic predictions indicates a common model limitation [2].
- Check for prediction invariance: Investigate whether your model's predictions vary meaningfully across different perturbations. Some models may predict little to no change from the control condition, regardless of the perturbation applied [2].
Solution Steps:
- Use a robust null model: Establish a rigorous null model (e.g., assuming a Normal distribution for deviations from additivity) to identify true genetic interactions in your data at a defined false discovery rate (e.g., 5%) [2].
- Quantify model performance on interactions: For each model, compute true-positive rate (TPR) and false discovery proportion curves across all possible prediction thresholds to see if any model outperforms a simple "no change" baseline. Current research suggests that many do not [2].

Common Problem: Failure in Zero-Shot Cell Type Clustering and Batch Integration

Q: The cell embeddings produced by my foundation model in a zero-shot setting fail to separate cell types effectively or remove batch effects. Why?

A: Zero-shot evaluation of foundation models like scGPT and Geneformer reveals that their cell embeddings often underperform compared to established methods for tasks like cell type clustering and batch correction. The primary structure in the embeddings may be driven by batch effects rather than biological signal [1].

Diagnosis Steps:
- Visualize embeddings: Use dimensionality reduction (e.g., UMAP, t-SNE) to visually inspect the embeddings. If the primary clustering is by batch or data source rather than known cell type, the embeddings are not performing well [1].
- Compare to established baselines: Calculate clustering and batch integration metrics (e.g., Average BIO score, ASW, batch mixing scores) for your foundation model embeddings and compare them against simpler methods like Highly Variable Genes (HVG), Harmony, or scVI [1].
Solution Steps:
- Rely on proven methods for clustering and integration: For zero-shot tasks, methods like HVG selection, scVI, or Harmony currently provide more reliable performance for cell type separation and batch integration [1].
- Re-evaluate the pretraining objective: The masked language model pretraining used by many scFMs may not inherently produce high-quality cell embeddings for all downstream tasks. Consider if your task aligns with the model's pretraining objective [1].

Quantitative Performance Data

Table 1: Benchmarking Model Performance on Double Genetic Perturbation Prediction (based on Norman et al. data in [2])

Model / Baseline	Prediction Error (L2 distance) vs. Additive Baseline	Strength in Predicting Genetic Interactions
Additive Baseline	Reference (Best)	None (by definition)
No Change Baseline	Higher	Poor (cannot predict synergy)
GEARS	Higher	Poor (rarely predicts correct synergy)
scGPT	Higher	Poor
Geneformer*	Higher	Poor
scBERT*	Higher	Poor

Note: Models marked with * were not originally designed for the task and were repurposed with a linear decoder. L2 distance was calculated for the top 1,000 most highly expressed genes. None of the deep learning models outperformed the simple additive baseline [2].

Table 2: Zero-Shot Performance on Cell Type Clustering (AvgBIO Score) (representative data from [1])

Model / Method	Pancreas Dataset	Immune Dataset	Tabula Sapiens	PBMC (12k)
HVG (Highly Variable Genes)	Best	Best	Best	High
scVI	High	Medium	High	Medium
Harmony	High	Medium	Medium	Medium
scGPT	Low	Low	Low	Best
Geneformer	Low	Low	Low	Low

Table 3: Key Reagent Solutions for Perturbation Screening

Reagent / Material	Function in Experiment
CRISPR Activation/Interference System	Used to perform targeted genetic perturbations (gene knockout or activation) in cell lines (e.g., K562, RPE1) to generate ground-truth data for model training and validation [2].
SMILES Strings & RDKit	Simplified Molecular-Input Line-Entry System strings provide a standardized text representation of a compound's chemical structure. RDKit is a cheminformatics library used to process SMILES and generate molecular fingerprints (e.g., FCFP) for model input, enabling generalization to novel compounds [18].
LanthaScreen Eu Kinase Binding Assay	A TR-FRET based binding assay. Useful for studying kinase interactions, including with inactive forms of the kinase, which may not be possible with activity assays [19].
Functional-Class Fingerprints (FCFP)	A type of molecular fingerprint generated from a compound's SMILES string. It captures functional topology information, which can be rescaled by dosage to create an embedding vector representing the chemical perturbation for models like PRnet [18].

Experimental Protocols

Protocol 1: Benchmarking Perturbation Prediction Against Linear Baselines

This protocol is adapted from the benchmarking study detailed in [2].

Data Preparation:
- Obtain a perturbation dataset with transcriptome-wide expression values (e.g., log-transformed RNA-seq) for both control and perturbed conditions. Example datasets include Norman et al. (double perturbations) or Replogle et al. (single gene perturbations) [2].
- For double perturbation benchmarks, split the double perturbations into training and held-out test sets (e.g., 62 for training, 62 for testing). Include all single perturbations in the training data.
- Define the set of read-out genes for evaluation (e.g., the 1,000 most highly expressed genes).
Establish Baselines:
- No Change Model: For any perturbation, predict the expression values of the control condition.
- Additive Model: For a double perturbation A+B, predict the sum of the individual LFCs of A and B added to the control expression. LFC is calculated as log(expression_perturbed + 1) - log(expression_control + 1).
Model Training and Fine-tuning:
- Fine-tune the foundation models (e.g., scGPT, GEARS, Geneformer) on the training set of perturbations according to their respective protocols.
- For a fair comparison, ensure all models are evaluated on the same gene set and using the same data splits.
Evaluation:
- Calculate the L2 distance (Euclidean distance) between the predicted and observed expression values for the read-out genes in the test set.
- Compare the prediction error of the deep learning models against the simple baselines.

Protocol 2: Zero-Shot Evaluation of Cell Embeddings

This protocol is based on the zero-shot evaluation framework presented in [1].

Embedding Generation:
- Select a labeled dataset with known cell types and, if applicable, multiple batches. Suitable benchmark datasets include the Pancreas dataset or Tabula Sapiens [1].
- Process the dataset using the foundation model (e.g., Geneformer, scGPT) without any fine-tuning to extract cell embeddings.
Baseline Methods:
- Apply established methods for comparison:
  - HVG: Select highly variable genes and use them directly for analysis.
  - scVI: Train the scVI model on the dataset to obtain cell embeddings.
  - Harmony: Apply the Harmony integration algorithm to the principal components of the data.
Task Evaluation:
- Cell Type Clustering: Use the embeddings to cluster cells and compare the resulting clusters to the known cell type labels. Calculate metrics like Average BIO score and Average Silhouette Width (ASW).
- Batch Integration: If the dataset contains batch information, evaluate how well the embeddings mix cells from different batches while preserving biological variation (cell type). Use metrics like batch mixing scores and Principal Component Regression (PCR) to quantify the variance explained by batch.

Method Workflow and Pathway Diagrams

Diagnosing and Overcoming scFM Shortcomings: A Troubleshooting Guide

Frequently Asked Questions (FAQs)

FAQ 1: What are the most common failure modes for zero-shot perturbation prediction in single-cell foundation models (scFMs)?

Research indicates several key failure modes when using scFMs for zero-shot perturbation prediction:

Inability to Predict Genetic Interactions: Models struggle to predict non-additive effects, such as synergy or buffering, in double-gene perturbations. They often default to predicting no change or additive effects, failing to capture the true biological complexity [2].
Poor Generalization to Unseen Data: The embeddings and predictions from scFMs do not consistently outperform simple baselines, especially under distribution shift or when predicting the effects of perturbations not seen during the model's pretraining [6] [2].
Inadequate Batch Integration: In zero-shot settings, scFM-generated cell embeddings often fail to correct for technical batch effects. The primary structure in the embedding space can be driven more by batch than by meaningful biological variation, which hinders downstream analysis [1].

FAQ 2: My model's perturbation predictions seem biologically implausible. What could be wrong?

This is a recognized limitation in current scFMs. Benchmarking studies have found that a significant challenge for these models is predicting "strong or atypical perturbation effects" [6]. Furthermore, models may exhibit biases, such as consistently predicting buffering interactions while rarely and inaccurately predicting synergistic ones [2]. This suggests the model may not have learned the underlying gene regulatory networks robustly enough for out-of-distribution predictions.

FAQ 3: Can I trust the gene embeddings from a foundation model for my perturbation analysis?

Caution is advised. External benchmarks that extracted gene embeddings from scFMs like scGPT and scFoundation found that using these embeddings in a simple linear model did not consistently outperform using embeddings derived directly from the perturbation training data [2]. This indicates that the pretraining on single-cell atlases may not yet provide a decisive advantage over task-specific training for perturbation prediction.

FAQ 4: Are there any strategies to improve the accuracy of my perturbation predictions?

Evidence suggests that moving from an "open-loop" to a "closed-loop" framework can significantly enhance performance. This involves fine-tuning the foundation model by incorporating a limited amount of experimental perturbation data (e.g., from Perturb-seq). Studies have shown that even a small number of perturbation examples (around 20) integrated during fine-tuning can dramatically improve metrics like positive predictive value, sensitivity, and specificity [10].

Troubleshooting Guides

Problem: Model fails to predict genetic interactions in double-gene knockout experiments.

Observation	Model output for a double perturbation is essentially the sum of the two single perturbations (additive effect), or shows no change from the control. The model fails to identify known synergistic or buffering interactions.
Root Cause	The model has not effectively learned the underlying, non-linear relationships between genes that lead to emergent effects in combinations. The pretraining objective may not adequately prepare the model for this specific task in a zero-shot setting [2].
Solution	1. Employ a Simple Baseline: Always compare your model's performance against a simple additive baseline model, which predicts the double perturbation effect as the sum of the two individual logarithmic fold changes. 2. Utilize Closed-Loop Fine-Tuning: If possible, move away from a pure zero-shot setting. Fine-tune the foundation model on any available double perturbation data, even from other cell types or conditions, to help it learn the concept of genetic interactions [10].
Prevention	When selecting a model, consult recent independent benchmarks that explicitly test for genetic interaction prediction rather than relying solely on claims from model publications [2].

Problem: Predictions are inaccurate for perturbations on genes or in cell types not well-represented in the model's pretraining data.

Observation	The model's predictions for unseen perturbations are no better than simply predicting the average expression across the training set. Performance degrades significantly under distribution shift [6] [2].
Root Cause	The foundation model's knowledge is constrained by the scope and diversity of its pretraining corpus. It lacks the ability to generalize reliably to completely novel biological contexts without additional guidance.
Solution	1. Use a Perturbation-Informed Baseline: Implement a linear model that leverages embeddings pretrained on large-scale perturbation datasets (if available), which may generalize better than atlas-based pretraining [2]. 2. Leverage Similarity Metrics: For diseases with no known treatments, use models that explicitly transfer knowledge from similar, well-annotated diseases via metric learning, rather than relying on a pure zero-shot approach [20].
Prevention	Critically evaluate the pretraining data composition of any scFM before applying it to your specific research problem to identify potential data gaps.

Problem: Cell embeddings from the foundation model do not integrate well across batches in a zero-shot setting.

Observation	When visualizing the embeddings (e.g., via UMAP), cells cluster strongly by batch or experiment of origin rather than by cell type or biological state.
Root Cause	The model's pretraining objective (e.g., masked language modeling) does not automatically learn to produce batch-invariant representations. In zero-shot use, it has not been explicitly trained to remove technical noise [1].
Solution	1. Use Established Batch Correction Tools: Pass the scFM embeddings into dedicated batch integration algorithms like Harmony or scVI as a post-processing step. 2. Try a Simpler Approach: Benchmark the scFM's performance against a simple baseline of Highly Variable Genes (HVG), which has been shown to outperform foundation models in some batch integration tasks [1].
Prevention	Do not assume that a foundation model's embeddings are batch-corrected by default. Always include batch integration as a formal step in your analysis workflow.

Quantitative Performance Data

Table 1: Benchmarking scFMs against Baselines in Double Perturbation Prediction (L2 Distance, lower is better) [2]

Model / Baseline	Average L2 Distance (Top 1000 Genes)
Additive Model (Baseline)	~1.5
No Change Model (Baseline)	~2.5
scGPT	~2.5
Geneformer*	~3.0
GEARS	~2.3
scFoundation	~2.1

Note: Models marked with * were repurposed with a linear decoder.

Table 2: Zero-Shot Batch Integration Performance (Average BIO Score, higher is better) [1]

Method	Pancreas Dataset	PBMC Dataset	Tabula Sapiens Dataset
HVG (Baseline)	~0.7	~0.75	~0.8
scVI	~0.68	~0.72	~0.78
Harmony	~0.65	~0.7	~0.75
scGPT	~0.55	~0.72	~0.65
Geneformer	~0.45	~0.5	~0.55

Table 3: Impact of Closed-Loop Fine-Tuning on Prediction Accuracy [10]

Metric	Open-Loop ISP	Closed-Loop ISP (with ~20 examples)
Positive Predictive Value (PPV)	3%	9%
Sensitivity	48%	76%
Specificity	60%	81%
Negative Predictive Value (NPV)	98%	99%

Experimental Protocols

Protocol 1: Benchmarking Perturbation Effect Prediction

This protocol is based on the benchmark described in [2].

Data Preparation: Obtain a single-cell perturbation dataset with both single and double gene perturbations (e.g., from Norman et al.). Split the double perturbations into a training and a held-out test set.
Baseline Setup:
- Additive Baseline: For each double perturbation in the test set, calculate the predicted expression as the sum of the LFCs from the two corresponding single perturbations.
- No-Change Baseline: Predict that the expression in the double perturbation equals the control condition.
Model Fine-Tuning: Fine-tune the foundation models (e.g., scGPT, Geneformer) and other deep learning models (e.g., GEARS) on the training set of single and double perturbations according to their specified procedures.
Prediction & Evaluation: Run the fine-tuned models on the held-out test set of double perturbations. Calculate the L2 distance between the predicted and observed expression values for the top 1,000 highly expressed genes. Compare model performance against the baselines.

Protocol 2: Evaluating Zero-Shot Cell Embeddings for Batch Integration

This protocol is based on the evaluation performed in [1].

Embedding Generation: Process a publicly available benchmark dataset with known batch effects and cell types (e.g., the Pancreas dataset) using the scFM in zero-shot mode to generate cell embeddings. Generate embeddings using baseline methods (HVG, scVI, Harmony) for the same dataset.
Dimensionality Reduction and Visualization: Apply UMAP to the embeddings from all methods to produce 2D visualizations.
Qualitative Assessment: Inspect the visualizations to determine if cells cluster primarily by biological cell type (good) or by technical batch (bad).
Quantitative Assessment: Calculate quantitative batch integration metrics such as the Average BIO score (which balances batch mixing and cell type separation) and the Average Silhouette Width (ASW) for each set of embeddings.
Comparison: Rank the performance of the scFM against the established baseline methods.

Signaling Pathways and Workflows

Zero-Shot Failure and Solution Workflow

Genetic Interaction Prediction Failure

The Scientist's Toolkit

Table 4: Essential Research Reagents and Computational Tools

Item	Function in Evaluation
CRISPRa/i Perturbation Datasets (e.g., Norman et al.)	Provides ground truth data for benchmarking model predictions of single and double gene perturbation effects on transcriptomes [2].
Benchmark Datasets with Batch Effects (e.g., Pancreas Dataset)	Allows for the evaluation of a model's zero-shot ability to integrate data from multiple sources and correct for technical variation [1].
Perturb-seq Data	Single-cell RNA-sequencing data from genetic perturbation screens. Used for closed-loop fine-tuning of foundation models to dramatically improve prediction accuracy [10].
High-Performance Computing (HPC) Cluster	Essential for running and fine-tuning large foundation models, which are computationally intensive and often require GPU acceleration [2] [10].
Linear Regression Model	A deliberately simple baseline model. Critical for determining if a complex foundation model provides any meaningful performance improvement for a given task [2].
Batch Correction Tools (e.g., Harmony, scVI)	Established algorithms used to correct for technical batch effects. Can be applied as a post-processing step to scFM embeddings or used as a performance baseline [1].

Single-cell foundation models (scFMs) like scGPT and Geneformer represent a significant advance in computational biology, promising to leverage large-scale pretraining to understand cellular states and predict experimental outcomes. A particularly ambitious goal for these models is zero-shot perturbation prediction—forecasting a cell's transcriptional response to genetic or chemical perturbation without any task-specific training data.

However, recent rigorous benchmarking studies reveal a sobering reality: in zero-shot settings, these complex models often fail to outperform simpler, traditional methods. This technical support article frames these limitations within the critical context of data quality and curation, providing researchers with troubleshooting guidance to navigate these challenges in their experimental workflows.

Troubleshooting Common scFM Experimental Issues

FAQ: Zero-Shot Performance and Model Selection

Q: My zero-shot scFM embeddings are underperforming for cell type annotation. What might be wrong? A: This is a recognized systematic limitation. Benchmarking studies indicate that scFM embeddings in zero-shot settings frequently underperform established dimensionality reduction techniques. Before assuming an implementation error, compare your results against a baseline method.

Recommended Action: Reproduce the benchmark finding by comparing your scFM embeddings (e.g., from scGPT or Geneformer) against a Highly Variable Genes (HVG) selection followed by PCA, or embeddings from scVI or Harmony. One study found that HVG selection alone outperformed both Geneformer and scGPT in cell type clustering across multiple datasets as measured by Average BIO score [1].

Q: Why do my scFM's perturbation effect predictions seem inaccurate? A: Predicting transcriptional responses to perturbation is a fundamentally challenging task. Recent evidence suggests that the foundational pretraining of many scFMs may not be optimally transferring to this specific objective.

Recommended Action: Validate your model's predictions against a deliberately simple baseline. For predicting double perturbation effects, compare your model's performance against a simple additive model (sum of individual logarithmic fold changes) or even a "no change" model (predicting control condition expression). A benchmark of five foundation models found none could consistently outperform these simple baselines [2].

Q: Can I trust a high benchmark score reported in an scFM's original publication? A: Exercise caution. Some original model publications may have used benchmark settings or comparisons that were particularly favorable. Independent, post-publication benchmarking is crucial for a realistic performance assessment.

Recommended Action: Consult recent, independent benchmark studies like PertEval-scFM or those published in peer-reviewed journals. These often reveal that perceived performance can be inflated due to factors like data duplication or suboptimal baseline comparisons [6] [2]. Always check if a simpler linear model, perhaps using pretrained embeddings, can achieve similar or better results [2].

Data Curation Troubleshooting Guide

Q: My model shows inflated performance during training but fails on external validation. What is the cause? A: This is a classic symptom of inadequate data curation. A common culprit is the presence of duplicate or non-independent data points in your training set, which leads to overfitting and poor generalizability.

Root Cause: The model is memorizing specific data points rather than learning generalizable biological principles.
Solution: Implement a rigorous data deduplication protocol. In one case study, models built with uncurated data showed a 7-24% higher correct classification rate, but this performance was illusory and attributed to duplicates in the training set [21].
Prevention: Before training, audit your dataset for chemical duplicates, batch effects, and technical replicates that may create data leakage.

Q: How does data quality from experimental sources impact my model's reliability? A: Profoundly. The reproducibility of the underlying experimental data used for training and benchmarking is a primary constraint on your model's achievable accuracy.

Evidence: Analyses of experimental toxicology data, often used in bioinformatics models, show that reproducibility for the same chemical can be as low as 60-75% between studies due to differences in protocol, dosing, and animal models [21].
Implication: Prediction error cannot be smaller than the underlying experimental measurement error. If your training data contains conflicting labels for the same biological entity, the model cannot learn a consistent pattern.
Action: When aggregating data from public sources, invest time in harmonizing study protocols and experimental conditions. Prioritize data from standardized, validated assays.

Experimental Protocols for Rigorous scFM Benchmarking

Protocol: Benchmarking Zero-Shot Cell Embedding Quality

Objective: To quantitatively evaluate the quality of scFM-generated cell embeddings for downstream tasks like cell type clustering and batch integration, comparing them against established baseline methods.

Materials:

A labeled single-cell dataset with known cell types and batch information (e.g., a human pancreas dataset from five different sources [1]).
Access to an scFM (e.g., scGPT or Geneformer) for generating zero-shot embeddings.
Standard software packages for single-cell analysis (e.g., Scanpy, Scikit-learn).

Methodology:

Data Preprocessing: Apply standard quality control and normalization to your chosen dataset.
Embedding Generation:
- Generate cell embeddings using the scFM in zero-shot mode (no fine-tuning).
- Generate comparable low-dimensional representations using baseline methods:
  - HVG+PCA: Select highly variable genes and perform PCA.
  - scVI: Train a scVI model (using cell type labels for supervision is not permitted for a fair zero-shot comparison).
  - Harmony: Run Harmony on the PCA embedding to integrate batches.
Quantitative Evaluation: Calculate clustering and integration metrics for all embeddings.
- For cell type separation, compute the Average BIO score and Average Silhouette Width (ASW). Higher values indicate better separation of known cell types [1].
- For batch integration, compute the PCR (Principal Component Regression) score and batch mixing metrics. Lower PCR scores indicate less variance explained by batch effects [1].
Visual Inspection: Create UMAP plots colored by cell type and batch for each embedding method to qualitatively assess performance.

Expected Outcome: Simpler methods like HVG selection, scVI, and Harmony will often outperform or match scFMs in zero-shot settings. The following table summarizes typical benchmark results:

Table 1: Sample Benchmark Results for Cell Embedding Quality (Based on [1])

Evaluation Metric	scGPT	Geneformer	HVG+PCA	scVI	Harmony
AvgBIO Score (Cell Type)	Variable, often lower	Underperforms	Consistently high	High	High
Batch Mixing Score	Moderate on seen data	Consistently low	High	High on technical batches	High on technical batches
Key Weakness	Inconsistent across datasets	Poor preservation of cell type info	N/A	Struggles with complex biological batches	Lower PCR score on complex datasets

Protocol: Evaluating Perturbation Effect Prediction

Objective: To assess an scFM's ability to predict gene expression changes after single or double genetic perturbations, comparing its accuracy against simple additive and "no change" baselines.

Materials:

A perturbation dataset with ground-truth expression values for single and double perturbations (e.g., the Norman et al. CRISPRa dataset [2]).
The scFM to be evaluated (e.g., scGPT, scFoundation, Geneformer with a linear decoder).
Code for the simple baseline models.

Methodology:

Data Partitioning: For double perturbation prediction, fine-tune the model on all single perturbations and a held-in portion of double perturbations. Hold out a set of double perturbations for testing.
Baseline Setup:
- Additive Baseline: For each double perturbation A+B, predict the sum of the LFCs from the single perturbations A and B.
- "No Change" Baseline: Predict that the expression in the double perturbation condition is identical to the control condition.
Model Prediction: Generate expression predictions for the held-out double perturbations using the scFM.
Error Calculation: Calculate the L2 distance between predicted and observed expression values for the top 1,000 most highly expressed genes. Repeat across multiple random train/test splits for robustness.
Interaction Analysis: Identify genetic interactions in the ground-truth data (where the double perturbation effect non-additively deviates from the single effects). Assess the model's ability to recall these interactions versus generating false positives.

Expected Outcome: The simple additive baseline is often very difficult to beat. Most scFMs will exhibit a higher prediction error (L2 distance) than this baseline [2].

Table 2: Performance Overview in Perturbation Prediction Benchmarks (Based on [2])

Model Type	Performance vs. Additive Baseline	Performance on Unseen Perturbations	Identification of Genetic Interactions
scFMs (scGPT, scFoundation)	Higher prediction error	Not consistently better than linear models	Struggles; mostly predicts buffering types
Other DL Models (GEARS, CPA)	Higher prediction error	Not designed for unseen perturbations	Varies, but often suboptimal
Simple Additive Model	Baseline	Not applicable	Cannot predict by definition
Simple "No Change" Model	Higher prediction error	N/A	Cannot predict synergistic interactions
Linear Model with Pre-trained Embeddings	N/A	Can outperform full scFMs	N/A

The Scientist's Toolkit: Essential Research Reagents & Materials

Table 3: Key Resources for scFM Research and Benchmarking

Resource Name	Type	Function/Benefit	Example/Note
PertEval-scFM	Benchmarking Framework	Standardized framework for evaluating perturbation effect prediction in zero-shot settings [6].	Helps avoid inflated performance claims.
Norman et al. Dataset	Benchmark Data	Provides ground-truth expression for 100 single and 124 double gene perturbations in K562 cells [2].	Essential for reproducibility in perturbation tasks.
Replogle et al. Datasets	Benchmark Data	CRISPRi datasets in K562 and RPE1 cells for evaluating prediction on unseen perturbations [2].	Tests model generalizability across cell lines.
Linear Baseline Model	Computational Method	A simple, interpretable model using gene and perturbation embeddings; serves as a critical sanity check [2].	Often matches or beats complex scFMs.
TxGNN	Foundation Model (Drug Repurposing)	A graph-based model for zero-shot drug repurposing; demonstrates successful zero-shot application in a related domain [20].	Provides a design pattern for effective zero-shot learning.
Rigorous Data Curation Pipeline	Methodology	A protocol for deduplication, unit harmonization, and error checking of input data [21] [22].	The most critical factor for building reliable models.

Workflow Visualization: From Data Curation to Reliable Prediction

The following diagram illustrates the critical path for developing and evaluating reliable single-cell foundation models, highlighting how rigorous data curation and benchmarking are foundational to success.

Diagram 1: scFM Development Workflow

The limitations observed in zero-shot perturbation prediction by current single-cell foundation models are not merely algorithmic but are fundamentally tied to the quality, reproducibility, and curation of the data on which they are built and evaluated. Researchers can navigate this landscape more effectively by:

Mandating Rigorous Data Curation as a non-negotiable first step to avoid inflated and non-reproducible results.
Systematically Employing Simple Baselines in every evaluation to ground truth model performance expectations.
Leveraging Independent Benchmarks to make informed decisions about model selection and application.

By adopting these practices, the field can build a more reliable foundation, steering the development of scFMs from models that underperform simple methods to robust tools that genuinely advance biological discovery.

Troubleshooting Guides & FAQs

This resource addresses common challenges researchers face when designing and evaluating models for biological reasoning, particularly in the context of zero-shot prediction of genetic perturbation effects.

Frequently Asked Questions

FAQ 1: Why do our sophisticated foundation models underperform simple baselines in zero-shot perturbation prediction?

Answer: Current single-cell foundation models (scFMs) often underperform simpler methods like Highly Variable Genes (HVG) selection or linear models in zero-shot settings due to several potential architectural and training limitations [1] [2] [23].

Problematic Pretraining Objective: The masked language modeling objective used by many scFMs may not inherently learn representations that are optimal for predicting downstream perturbation effects. The pretraining task and the zero-shot task can be misaligned [1].
Lack of Perturbation Data in Pretraining: Many models are pretrained primarily on observational single-cell data. Without explicit examples of perturbation effects during pretraining, the models may not learn the causal principles needed for robust prediction [2].
Over-reliance on Fine-Tuning: Model development often prioritizes performance after task-specific fine-tuning, which can mask fundamental weaknesses in the model's core, zero-shot representations [1].

Troubleshooting Steps:

Benchmark Against Simple Baselines: Always compare your model's zero-shot performance against deliberately simple baselines, such as an additive model of single-gene effects or a "no change" predictor [2].
Analyze Embedding Quality: Evaluate the cell and gene embeddings produced by your model for tasks like cell type clustering and batch integration. If they underperform established methods like scVI or Harmony, the core representations may be flawed [1].
Inspect Pretraining Data: Check for overlap between your evaluation datasets and the model's pretraining corpus. Performance on "seen" data does not guarantee generalizability [1].

FAQ 2: How can we improve a model's generalization to unseen cell types or strong perturbations?

Answer: Generalization fails when models learn dataset-specific artifacts rather than underlying biological principles. This is evident in the significant performance drop observed under distribution shift [23].

Architectural Tweak: Incorporate Efficient Fine-Tuning. Instead of full model fine-tuning, use parameter-efficient methods like adapters. A drug-conditional adapter can enable prediction of responses to novel drugs and zero-shot generalization to unseen cell lines while preserving the original biological knowledge from pretraining [24].
Data-Centric Strategy: Diversify Pretraining Data. Ensure pretraining encompasses a wide variety of cell types, states, and experimental conditions. Crucially, incorporate perturbation data where possible to teach the model about causal relationships [2] [25].
Evaluation Protocol: Rigorously evaluate using a hold-out set that includes novel perturbations or cell types completely absent from training data to truly assess generalization [23].

FAQ 3: Our model fails to predict genetic interactions. What could be the cause?

Answer: Predicting non-additive genetic interactions (e.g., synergistic or buffering effects) is a complex challenge. Many models default to predicting buffering interactions or struggle to deviate from additive expectations [2].

Root Cause: The model may lack the architectural capacity to represent complex, non-linear interactions between genes. Alternatively, the training objective may not sufficiently penalize this failure mode.
Solution Approach: Move beyond simple regression losses. Consider architectural changes that explicitly model gene-gene interactions, perhaps by incorporating structured knowledge like gene regulatory networks (GRNs) or by using attention mechanisms to capture interdependencies between genes [25].

Quantitative Performance Benchmarking

The tables below summarize key findings from recent benchmarks, highlighting the performance gap between proposed scFMs and simpler models.

Table 1: Zero-Shot Cell Embedding Performance on Clustering (AvgBIO Score) [1]

Model / Method	Pancreas Dataset	PBMC (12k) Dataset	Tabula Sapiens Dataset
HVG (Baseline)	0.671	0.631	0.617
scVI	0.645	0.619	0.605
Harmony	0.632	0.600	0.589
scGPT	0.580	0.640	0.590
Geneformer	0.512	0.528	0.523

Table 2: Perturbation Effect Prediction Performance (L2 Distance, lower is better) [2]

Model / Method	Double Perturbation Prediction	Unseen Perturbation Prediction
Additive Model (Baseline)	~1.4	Not Applicable
No Change Model (Baseline)	~1.7	~1.5
scGPT	~1.7	~1.6
Geneformer*	~1.8	~1.7
GEARS	~1.6	~1.6

*Note: Models marked with * were repurposed for this task with a linear decoder.

Experimental Protocols for Key Evaluations

Protocol 1: Zero-Shot Cell Embedding Evaluation

This protocol evaluates the quality of cell embeddings for downstream tasks without any fine-tuning [1].

Input: A target single-cell dataset (e.g., from a held-out tissue or study).
Embedding Generation: Pass the raw gene expression data through the foundation model to obtain a cell-by-embedding matrix. Do not update model weights.
Dimensionality Reduction: Apply UMAP or t-SNE to the embedding matrix for visualization.
Downstream Task: Perform tasks like:
- Cell Type Clustering: Use Leiden or Louvain clustering on the embeddings and compute metrics like Average BIO (AvgBIO) score or Average Silhouette Width (ASW) against known cell type labels.
- Batch Integration: Assess how well the embeddings mix cells from different technical batches while preserving biological variation using metrics like PCR (Principal Component Regression) and batch mixing scores.
Comparison: Compare the results against embeddings from established methods (scVI, Harmony) and a simple HVG selection baseline.

Protocol 2: Zero-Shot Perturbation Effect Prediction

This protocol tests a model's ability to predict transcriptome-wide changes after a genetic perturbation without being explicitly trained on that perturbation [2] [23].

Data Preparation: Use a single-cell perturbation dataset (e.g., CRISPR-based gene knockout/activation). Split perturbations into a training set (e.g., all single genes and some doubles) and a held-out test set (e.g., unseen double perturbations).
Model Setup: For foundation models not designed for this task, attach a linear decoder that maps the model's cell embedding to a gene expression vector.
Fine-Tuning (if applicable): On the training set, fine-tune only the linear decoder (or a small adapter), keeping the foundation model's core weights frozen to maintain its zero-shot properties.
Prediction: For each held-out test perturbation, input the control cell state and the perturbation identity. Run the model to predict the post-perturbation gene expression profile.
Evaluation: Calculate the L2 distance or Pearson correlation between the predicted and observed expression profiles. Compare predictions against a simple "additive" baseline (sum of single-gene effects) or a "no change" baseline.

Model Architecture & Workflow Diagrams

HRM Brain-Inspired Architecture

Zero-Shot Perturbation Prediction

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools & Datasets for scFM Evaluation

Item Name	Function / Purpose	Example Use Case
PertEval-scFM [23]	A standardized benchmarking framework for evaluating zero-shot perturbation prediction.	Systematically comparing embedding quality from different scFMs against simple baselines.
Norman et al. Dataset [2]	A canonical dataset with single and double gene perturbations in K562 cells.	Benchmarking model performance on predicting double perturbation effects and genetic interactions.
Replogle et al. Dataset [2]	A large-scale single-cell CRISPRi perturbation dataset across multiple cell lines (K562, RPE1).	Evaluating model generalization to unseen perturbations and cross-cell-line prediction.
Linear Decoder / Adapter [2] [24]	A simple, trainable layer attached to a frozen foundation model.	Enabling efficient task-specific fine-tuning while preserving pretrained knowledge for zero-shot evaluation.
Additive & No-Change Baselines [2]	Deliberately simple prediction models that serve as a critical sanity check.	Establishing a performance floor; any proposed complex model should outperform these.

Frequently Asked Questions

Q: What is a single-cell Foundation Model (scFM), and how is it supposed to work? A: A single-cell Foundation Model (scFM) is a large-scale deep learning model, typically based on a transformer architecture, that is pretrained on vast datasets containing millions of single-cell transcriptomes [26]. The concept is inspired by large language models. In these models, individual cells are treated like "sentences," and genes or genomic features are treated as "words" or "tokens" [26]. Through self-supervised pretraining (like predicting masked genes), the model aims to learn a universal representation of cellular states that can be adapted to various downstream tasks—such as predicting how a cell's gene expression will change after a genetic perturbation—without the need for extensive new experimental data [26].

Q: My goal is to predict the effects of unseen single or double genetic perturbations. Which model should I use? A: Current evidence suggests that you may achieve better or comparable results with deliberately simple baseline models rather than complex deep-learning scFMs. A 2025 benchmark study found that none of the five evaluated foundation models (including scGPT and scFoundation) and two other deep learning models outperformed simple additive or linear baselines for predicting transcriptome changes after single or double perturbations [2]. For predicting the effect of a perturbation not seen during training, a simple linear model or even just predicting the mean expression from the training data can outperform or match sophisticated foundation models [2].

Q: I have heard scFMs perform well in zero-shot settings. Is this true for tasks like batch integration or cell type identification? A: Rigorous zero-shot evaluation (using the model without any fine-tuning) reveals significant limitations. A 2025 study found that for cell type clustering and batch integration, the zero-shot performance of scGPT and Geneformer was inconsistent and often worse than established, simpler methods [1]. For instance, a simple approach of selecting Highly Variable Genes (HVG) frequently outperformed these foundation models in batch integration tasks. In cell type clustering, methods like scVI and Harmony generally provided more robust embeddings than the zero-shot scFM embeddings [1].

Q: What are the most common failure modes when using scFMs for perturbation prediction? A: Benchmarks have identified several specific failure modes [2]:

Lack of Generalization: Models often fail to predict effects for perturbations outside their training data.
Underestimation of Effects: Model predictions often vary less across different perturbations than the ground-truth experimental data does.
Poor Interaction Prediction: Models struggle to predict genetic interactions (where the effect of a double perturbation is not the sum of the single effects) and rarely correctly identify synergistic interactions.

Q: If simple models are currently better, what is the value of a foundation model? A: The primary value proposed for scFMs is their potential as a unified framework for multiple biological tasks. While they may not yet excel at specific tasks like perturbation prediction, their broad pretraining is intended to build a general understanding of biology [26]. The goal is a single model that can be adapted to many problems. However, current research indicates that this promise has not yet been fully realized for perturbation modeling, and simpler, task-specific models are more reliable for this application [2] [1].

Troubleshooting Your Experiments

Problem: Poor performance in predicting double perturbation effects.

Potential Cause: The model may not be effectively capturing non-additive genetic interactions.
Solution:
- Validate against a baseline: Always compare your model's performance against a simple additive baseline, which predicts the double perturbation effect as the sum of the two single perturbation effects [2].
- Inspect specific interactions: Analyze how the model performs on different classes of genetic interactions (buffering, synergistic, opposite). Current models have been shown to be particularly poor at predicting synergistic interactions correctly [2].

Problem: Model fails to generalize to unseen perturbations.

Potential Cause: The model's pretrained knowledge does not transfer effectively to the specific perturbation context of your experiment.
Solution:
- Use a linear baseline: Implement a simple linear model that uses embeddings for genes and perturbations. This baseline has been shown to be highly competitive and can help determine if the complexity of a foundation model is justified [2].
- Check for data leakage: Ensure that the model was not inadvertently pretrained on the test data you are using for evaluation, as this can lead to overly optimistic results [1].

Problem: Inconsistent zero-shot performance on tasks like cell type annotation or batch correction.

Potential Cause: The pretraining objective of masked language modeling may not directly produce high-quality cell embeddings for all downstream tasks without fine-tuning.
Solution:
- Compare with established methods: Use simple baselines like Highly Variable Genes (HVG), scVI, or Harmony as a benchmark for your task [1].
- Evaluate on multiple datasets: Test the model's robustness across datasets with different technologies, tissues, and batch structures. Performance can be highly variable [1].

Experimental Data & Benchmarks

Table 1: Benchmarking scFMs against baselines for double perturbation prediction. Data adapted from a 2025 study comparing model performance on predicting transcriptome changes after double gene perturbations. Prediction error is measured as L2 distance; lower is better [2].

Model / Baseline	Prediction Error (L2 Distance)	Outperforms Additive Baseline?
Additive Model (Simple Baseline)	~1.5 (Reference)	N/A
scGPT	~2.8	No
GEARS	~2.5	No
scFoundation	~2.4	No
Geneformer*	~3.2	No
No Change Model	~3.5	No

Note: Models marked with an asterisk were not originally designed for the task and were repurposed with a linear decoder [2].

Table 2: Performance of a linear model with various embeddings for unseen perturbation prediction. Data shows that a simple linear model can be highly effective. The "Perturbation Data" embedding refers to pretraining on other perturbation datasets (e.g., using K562 data to predict RPE1 effects) [2].

Embedding Source for Linear Model	Performance vs. Mean Baseline	Performance vs. scGPT/GEARS
Perturbation Data (e.g., Replogle)	Consistently outperforms	Outperforms
scFoundation Gene Embedding	Outperforms	Comparable or better
scGPT Gene Embedding	Outperforms	Comparable or better
Training Data Only	Comparable	Comparable or better

Table 3: Zero-shot performance on cell type clustering (Average BIO Score). Higher scores are better. Data shows that simple methods often outperform foundation models. Adapted from a 2025 zero-shot evaluation [1].

Model / Method	PBMC (12k) Dataset	Tabula Sapiens Dataset	Pancreas Dataset
HVG (Highly Variable Genes)	~0.75	~0.63	~0.58
scVI	~0.72	~0.61	~0.56
Harmony	~0.70	~0.55	~0.54
scGPT (Zero-shot)	~0.76	~0.57	~0.52
Geneformer (Zero-shot)	~0.65	~0.51	~0.48

Experimental Protocols

Protocol 1: Benchmarking Perturbation Effect Prediction Against an Additive Baseline

This protocol is crucial for critically evaluating any model's performance on predicting double perturbation effects [2].

Data Preparation: Obtain a dataset with measured transcriptomes for single gene perturbations (A and B) and their double perturbation (A+B). A commonly used dataset is from Norman et al., which includes 100 single and 124 double perturbations in K562 cells [2].
Define the Additive Baseline: For each gene, calculate the Logarithmic Fold Change (LFC) for single perturbations A and B versus a control. The additive prediction for the double perturbation is: LFC_A+B (predicted) = LFC_A + LFC_B.
Train and Evaluate Model: Fine-tune your model on a training set of single and some double perturbations. Evaluate its prediction error on a held-out test set of double perturbations.
Calculate Prediction Error: Compute the L2 distance between the predicted and observed gene expression values for the top 1,000 most highly expressed genes.
Compare Results: Directly compare the model's prediction error against the error of the additive baseline. A model that cannot outperform this simple baseline is not capturing meaningful interaction effects [2].

Protocol 2: Zero-Shot Evaluation for Cell Type Clustering

This protocol assesses the intrinsic quality of a model's cell embeddings without fine-tuning [1].

Embedding Generation: Pass a labeled dataset (e.g., from the Tabula Sapiens project) through the pretrained scFM without any fine-tuning to obtain a latent embedding for each cell.
Dimensionality Reduction: Apply a standard technique like UMAP to the cell embeddings to reduce them to two dimensions for visualization.
Clustering and Metric Calculation:
- Use a clustering algorithm like Leiden or Louvain on the full-dimensional embeddings.
- Calculate clustering metrics such as the Average BIO score and Average Silhouette Width (ASW) by comparing the clusters to the known cell type labels.
Benchmark Against Baselines: Compare the metrics against those obtained from embeddings generated by established methods like scVI or simple HVG selection.

Model Comparison Workflow

The following diagram illustrates a recommended workflow for selecting and evaluating a model for perturbation prediction, incorporating key questions and steps based on recent benchmark findings.

The Scientist's Toolkit

Table 4: Key Research Reagents and Datasets for scFM Benchmarking

Item	Function in Evaluation	Source / Example
Norman et al. Dataset	Provides ground-truth data for single and double gene perturbations (CRISPRa) in K562 cells. Used for benchmarking double perturbation prediction [2].	Norman et al. 2019, as processed by subsequent studies [2].
Replogle et al. Dataset	A large-scale CRISPRi dataset in K562 and RPE1 cells. Used for benchmarking the prediction of unseen single perturbations [2].	Replogle et al. 2022 [2].
Andersen Cascade Impactor (ACI)	Standard apparatus for measuring Aerodynamic Particle Size Distribution (APSD) of inhaled formulations like MDIs. Used in pharmaceutical development [27].	Referenced in Pharmacopoeial methods (Ph Eur 2.9.18 / USP <601>) [27].
Tabula Sapiens Dataset	A large, multi-tissue, single-cell reference atlas. Used as a benchmark for evaluating zero-shot performance on cell type clustering and batch integration [1].	Tabula Sapiens Consortium [1].
Simple Additive Model	A critical baseline model that sums the effects of single perturbations to predict a double perturbation. Used to validate if more complex models provide any advantage [2].	Implemented from scratch as per benchmarking studies [2].
Linear Model with Embeddings	A simple yet powerful model for predicting unseen perturbations. It uses gene and perturbation embedding matrices learned from data [2].	Kernfeld et al. / Csendes et al. (preprints) [2].

Rigorous Validation and Comparative Analysis of scFM Performance

Frequently Asked Questions

Q1: What is the core finding of recent benchmarks on single-cell Foundation Models (scFMs) for perturbation prediction? Recent rigorous benchmarks consistently show that large, pretrained scFMs often fail to outperform deliberately simple baseline models when predicting gene expression changes after genetic perturbations in a zero-shot setting. Notably, simple baselines like an additive model (summing individual logarithmic fold changes) or even just predicting the mean expression from the training data can match or exceed the performance of complex foundation models like scGPT, Geneformer, and scFoundation [2] [1] [6].

Q2: Why is zero-shot evaluation particularly important for scFMs? Zero-shot evaluation tests a model's ability to perform a task without any additional task-specific training. This is critical for:

Exploratory Biology: Many research scenarios lack predefined labels or sufficient data for fine-tuning.
Assessing True Understanding: It evaluates whether the model has learned general, transferable biological principles during pretraining, as opposed to just memorizing patterns [1].

Q3: What are some common simple baselines used in these benchmarks? Benchmarks often compare scFMs against the following baselines:

No change model: Always predicts the same expression as the control condition.
Additive model: For a double gene perturbation, predicts the sum of the individual logarithmic fold changes.
Mean prediction: Always predicts the overall average gene expression from the training data.
Linear models with basic embeddings [2] [1].

Q4: Do scFMs provide useful data representations (embeddings) for perturbation tasks? While scFMs like scGPT and scFoundation learn gene embeddings during pretraining, benchmarks found that using these embeddings in a simple linear model did not consistently outperform linear models using embeddings derived directly from the perturbation data itself. This suggests that pretraining on large single-cell atlases may offer only a small benefit for this specific task compared to training on relevant perturbation data [2].

Q5: What is PertEval-scFM? PertEval-scFM is a standardized benchmarking framework designed specifically to evaluate the ability of models, particularly single-cell Foundation Models, to predict transcriptional responses to genetic perturbations. It provides a rigorous and systematic way to test models in a zero-shot setting, highlighting their current limitations and guiding future development [6] [23] [5].

Troubleshooting Guides

Issue 1: Poor Performance on Perturbation Effect Prediction

Problem: Your scFM is not accurately predicting gene expression changes after a genetic perturbation and is underperforming compared to simple baselines.

Solutions:

Benchmark Against Simple Baselines: Before investing significant resources into a complex scFM, always implement simple baseline models like the additive model for double perturbations or a mean predictor. This establishes a performance floor and helps quantify the actual value added by the complex model [2].
Consider a Hybrid Linear Approach: Instead of using the scFM's full architecture for prediction, extract its pretrained gene embeddings. Use these embeddings as features in a simpler, interpretable linear model. This can sometimes yield more robust performance and is computationally cheaper [2].
Validate on Strong and Atypical Perturbations: Be aware that most models, including scFMs, struggle to predict strong or highly non-standard perturbation effects. Manually inspect performance on these cases, as they can reveal key weaknesses [23].
Use Perturbation Data for Pretraining: If possible, leverage existing large-scale perturbation datasets (e.g., from CRISPR screens) to pretrain or fine-tune your model. Evidence suggests that pretraining on perturbation data can be more beneficial than pretraining only on atlas-level single-cell data for this specific task [2].

Issue 2: Failure in Zero-Shot Generalization

Problem: The model performs well on data similar to its training set but fails to generalize to new cell types, tissues, or experimental conditions without fine-tuning.

Solutions:

Check for Data Distribution Shift: Evaluate the model's performance across datasets with varying technical and biological backgrounds. Models like scGPT may show inconsistent performance under distribution shift [1] [6].
Re-evaluate the Model's Purpose: If your task requires reliable zero-shot performance on truly novel data, current scFMs may not be the optimal tool. Consider whether established, simpler methods for your specific task (e.g., scVI for integration, HVGs for clustering) are more reliable [1].
Investigate Embedding Robustness: Perform a diagnostic check on the cell or gene embeddings produced by the scFM. Use visualization and metrics like batch integration scores to see if the embeddings are capturing relevant biology or are dominated by technical artifacts [1].

The table below summarizes the key experimental methodology from foundational benchmarking studies.

Table 1: Summary of Key Benchmarking Experiments

Benchmark Study	Core Task	Models Evaluated	Simple Baselines	Key Evaluation Metric
Nature Methods (2025) [2]	Predict transcriptome changes after single/double perturbations.	scGPT, scFoundation, Geneformer, GEARS, CPA, UCE, scBERT.	'No change', 'Additive', Linear model, 'Mean' prediction.	L2 distance between predicted & observed expression.
Genome Biology (2025) [1]	Zero-shot cell type clustering & batch integration.	Geneformer, scGPT.	Highly Variable Genes (HVG), Harmony, scVI.	Average BIO score, Average Silhouette Width.
PertEval-scFM (2024) [6] [23] [5]	Zero-shot perturbation effect prediction.	Five leading scFMs.	Simple baseline models.	Prediction accuracy on strong/atypical perturbations and under distribution shift.

The Scientist's Toolkit

Table 2: Essential Research Reagents & Computational Tools

Item Name	Function / Explanation	Relevance to Benchmarking
PertEval-scFM Framework	A standardized benchmarking framework for evaluating perturbation prediction.	Provides a rigorous protocol and metrics to fairly compare different models against baselines [6] [23].
Additive Model	A simple baseline that sums the LFCs of single perturbations to predict a double perturbation's effect.	Serves as a critical baseline; outperforming it is a minimum requirement for any proposed complex model [2].
Linear Model with Embeddings	A simple predictive model that uses gene/perturbation embeddings as input features.	Used to test whether an scFM's learned representations contain meaningful information for the task [2].
CRISPR Perturbation Datasets	High-quality datasets from studies like Norman et al., Replogle et al., and Adamson et al..	Provide the essential ground-truth data for training and benchmarking perturbation prediction models [2].
High-Variable Genes (HVG)	A standard feature selection method to filter genes before analysis.	A strong and simple baseline for tasks like cell type clustering and batch correction, often outperforming zero-shot scFMs [1].

Experimental Workflow Diagram

The following diagram illustrates the logical workflow for benchmarking a single-cell Foundation Model against simple baselines, as described in the referenced studies.

Frequently Asked Questions (FAQs)

Q1: When should I consider using a zero-shot single-cell foundation model (scFM) over a simpler, traditional method? The decision should be based on a careful evaluation of your specific task and resources. While scFMs are versatile and can be applied to diverse tasks without retraining, recent benchmarks indicate that in many cases, especially for perturbation effect prediction and cell type clustering, simpler methods can match or even surpass their performance [1] [2] [6]. You should prioritize scFMs when you need a single model for multiple exploratory tasks and have the computational resources to run them. For a single, well-defined task with a labeled dataset, a simpler model like a linear baseline or scVI may be more efficient and effective [28].

Q2: My zero-shot scGPT embeddings performed poorly on cell type clustering. What could be the reason? This is a commonly reported issue. The core problem may lie in the pretraining objective itself. Models like scGPT and Geneformer are often trained using a masked language modeling task, where they learn to predict the expression of randomly masked genes. However, evaluations suggest that these models may not have developed a deep, generalizable understanding of gene relationships from this task [1] [29]. They can struggle to predict held-out gene expression accurately, often defaulting to predicting median expression values, which indicates a failure to learn context-specific biological patterns that are crucial for distinguishing cell types zero-shot [29].

Q3: How can I rigorously benchmark my scFM for perturbation prediction to ensure the results are meaningful? A robust benchmark must include deliberately simple baselines. For predicting transcriptome changes after genetic perturbation, a "no change" model (predicting control condition expression) and an "additive" model (summing the logarithmic fold changes of single perturbations for a double perturbation) are essential comparators [2]. Surprisingly, multiple studies have found that current scFMs and other deep learning models often fail to outperform these simple baselines [2] [6]. It is also critical to evaluate the model's ability to predict genetic interactions and its performance on unseen perturbations, using established datasets from studies like Norman et al. and Replogle et al. [2].

Q4: What are the key metrics for evaluating the biological relevance of a scFM's embeddings? Beyond standard clustering metrics, it is important to use metrics that directly assess biological plausibility. Novel metrics like scGraph-OntoRWR measure the consistency between the cell-type relationships captured by the model's embeddings and the known relationships in established cell ontologies [28]. Additionally, for cell type annotation tasks, the Lowest Common Ancestor Distance (LCAD) metric can be used; it measures the ontological proximity of misclassified cell types, ensuring that any errors are biologically reasonable (e.g., confusing two closely related T-cell subtypes) rather than nonsensical [28].

Troubleshooting Guides

Problem: Poor Zero-Shot Performance on Cell Type Clustering and Batch Integration Issue: Your scFM embeddings fail to separate known cell types or correct for technical batch effects better than established methods like Harmony or scVI.

Step	Action	Expected Outcome & Further Diagnosis
1. Baseline Comparison	Compare your scFM results against a simple Highly Variable Genes (HVG) baseline and integration methods like Harmony or scVI [1].	If HVG outperforms your scFM, it indicates a fundamental issue with the foundational embeddings [1].
2. Check Data Overlap	Investigate if your evaluation dataset was part of the model's pretraining corpus (e.g., check original model papers) [1].	Strong performance on "seen" data but poor performance on "unseen" data suggests overfitting and a lack of generalizability [1].
3. Visual Inspection	Create UMAP plots colored by both cell type and batch [1].	A good embedding will show clear clustering by cell type and mixing of batches. If the primary structure is driven by batch, the model has failed at integration [1].
Solution: If the above steps confirm the model's limitations, consider switching to a simpler, more robust method for your specific task. The computational cost of scFMs may not be justified for zero-shot clustering and integration based on current evidence [1] [28].

Problem: Inaccurate Prediction of Genetic Perturbation Effects Issue: Your model cannot accurately predict gene expression changes following single or double genetic perturbations.

Step	Action	Expected Outcome & Further Diagnosis
1. Implement Simple Baselines	Benchmark your model against a "no change" model and an "additive" model for double perturbations [2].	Failure to outperform these baselines is a major red flag that the model has not learned the underlying biological causality [2].
2. Analyze Failure Modes	Examine which types of perturbations are poorly predicted.	Models often struggle with predicting strong or atypical perturbation effects and are biased towards predicting "buffering" interactions over "synergistic" ones [2] [6].
3. Test Embedding Utility	Extract the model's gene embeddings and use them in a simple linear predictor [2].	If the linear model with these embeddings performs as well as the full model, it suggests the model's complex decoder is not adding value [2].
Solution: Given that current scFMs struggle with this task, a pragmatic approach is to use a simple linear model, potentially augmented with pretrained gene embeddings from a foundation model or, more effectively, from prior perturbation data [2].

The following tables summarize key findings from recent, rigorous evaluations of single-cell foundation models.

Table 1: Zero-Shot Performance on Core Tasks vs. Baselines This table synthesizes findings from multiple benchmark studies comparing scFMs to established methods. [1] [28] [2]

Task	Top-Performing Methods	Underperforming Methods	Key Metric	Performance Summary
Cell Type Clustering	HVG, scVI, Harmony	scGPT, Geneformer	AvgBIO / ASW	scFMs were consistently outperformed by simpler methods across multiple datasets. In some cases, they performed worse than a randomly initialized model [1].
Batch Integration	HVG, scVI, Harmony	Geneformer, scGPT	Batch Integration Score, PCR	HVG often achieved the best scores. Geneformer consistently ranked last, sometimes increasing batch effect variance [1].
Perturbation Effect Prediction	Additive Model, No-Change Model, Linear Models	scGPT, Geneformer, GEARS, scFoundation	L2 Distance (Predicted vs. True Expr.)	Deep learning models, including scFMs, failed to consistently outperform simple baselines on predicting double perturbation outcomes or unseen perturbations [2] [6].

Table 2: scFM Performance on Perturbation Prediction Benchmark Data adapted from a study benchmarking models on the Norman et al. double perturbation dataset. [2]

Model / Baseline	Prediction Error (L2 Distance, mean ± SE)	Outperforms Additive Baseline?	Notes
Additive Baseline	Reference Value	N/A	Sums single-gene LFCs; does not use double-perturbation data for training.
No-Change Baseline	~1.5x Additive Error	No	Predicts control condition expression.
scGPT	~1.4x Additive Error	No	Struggled to predict strong interactions; predictions varied less than ground truth [2].
Geneformer*	~1.6x Additive Error	No	Repurposed with a linear decoder; performance was suboptimal [2].
scFoundation	~1.3x Additive Error	No	Predictions showed limited variation across different perturbations [2].
GEARS	~1.3x Additive Error	No	Specifically designed for perturbation prediction but was outperformed by simple baselines [2].

*Note: Geneformer was not originally designed for this task and was adapted for the benchmark. [2]

Experimental Protocols for Benchmarking

Protocol 1: Benchmarking Zero-Shot Embeddings for Clustering This protocol evaluates the intrinsic quality of scFM cell embeddings for discerning cell types without any fine-tuning [1] [28].

Input Data: Use a labeled, well-annotated scRNA-seq dataset with known cell types. Ideal datasets contain multiple batches or technologies to test integration (e.g., Pancreas benchmark dataset) [1].
Embedding Generation: Pass the raw or normalized count matrix through the scFM in "zero-shot" or "inference" mode to extract the cell embeddings. Do not perform any fine-tuning.
Baseline Generation:
- Generate cell embeddings using Highly Variable Genes (HVG) only.
- Generate embeddings using established integration methods like scVI and Harmony.
Dimensionality Reduction & Clustering: Apply standard pipelines (e.g., UMAP, Leiden clustering) to all embedding sets.
Evaluation:
- Primary Metrics: Calculate Average BIO (AvgBIO) score and Average Silhouette Width (ASW) to quantify cell type separation [1].
- Visualization: Create UMAP plots colored by cell type and batch to qualitatively assess biological preservation and technical integration [1].
- Biological Relevance: Apply ontology-informed metrics like scGraph-OntoRWR to check if the embedding's relational structure aligns with prior biological knowledge [28].

Protocol 2: Evaluating Perturbation Effect Prediction This protocol tests a model's ability to predict transcriptional changes after genetic perturbation, a key claimed ability of some scFMs [2] [6].

Data: Use a public perturbation dataset with both single and double perturbations, such as the Norman et al. (K562 cells) or Replogle et al. (K562/RPE1 cells) datasets [2].
Task Setup: For double perturbation prediction, fine-tune the models on all single perturbations and a held-in portion of double perturbations. Evaluate the prediction error on a held-out set of double perturbations.
Critical Baselines:
- No-Change Model: Always predicts the control condition's expression.
- Additive Model: For a double perturbation A+B, predicts the sum of the LFCs from single perturbations A and B.
Evaluation:
- Primary Metric: L2 distance between predicted and observed expression for the top 1,000 highly expressed genes [2].
- Interaction Prediction: Assess the model's ability to correctly identify non-additive genetic interactions (synergistic or buffering) by comparing predicted vs. observed deviations from the additive expectation [2].

Experimental Workflow and Signaling Pathways

Zero-Shot scFM Evaluation Workflow

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools and Datasets for scFM Evaluation

Item	Function / Description	Example Use in Benchmarking
Benchmarking Datasets	High-quality, publicly available scRNA-seq data with ground truth labels.	Pancreas Dataset [1]: Tests batch integration across 5 technologies.Norman et al. Perturbation Data [2]: Provides single/double CRISPRa perturbations for K562 cells.Tabula Sapiens [1]: A multi-tissue, multi-donor reference atlas.
Linear Baselines	Simple models that serve as a critical sanity check.	Additive Model [2]: Baseline for perturbation prediction.No-Change Model [2]: Predicts control expression.HVG + PCA [1]: Baseline for clustering and integration.
Established Methods	Robust, non-foundation model algorithms for standard tasks.	scVI [1]: A generative model for data integration and analysis.Harmony [1]: An algorithm for integrating datasets across technologies.
Ontology-Informed Metrics	Metrics that incorporate prior biological knowledge.	scGraph-OntoRWR [28]: Evaluates biological consistency of cell relationships.LCAD (Lowest Common Ancestor Distance) [28]: Measures severity of cell type misclassification.
Benchmarking Frameworks	Standardized code for fair and reproducible evaluation.	PertEval-scFM [6]: A framework for benchmarking perturbation prediction.MLflow / Weights & Biases [30]: Tools for tracking experiments, parameters, and metrics.

Frequently Asked Questions (FAQs)

FAQ 1: What does it mean that simple models outperform foundation models for perturbation prediction? Recent benchmark studies have demonstrated that deliberately simple baseline models can match or even surpass sophisticated single-cell foundation models (scFMs) in predicting gene perturbation effects [2]. For example, a simple 'additive' model, which predicts double-knockout effects by summing the logarithmic fold changes of single knockouts, outperformed models like scGPT, Geneformer, and GEARS on held-out double perturbation data [2]. This highlights a significant challenge in the field: the goal of creating a generalizable model that provides a robust representation of cellular states for predicting experimental outcomes remains elusive [2].

FAQ 2: Why is zero-shot evaluation particularly important for scFMs? Zero-shot evaluation—assessing a model's performance without any task-specific fine-tuning—is critical for judging whether pretraining has endowed the model with a genuine, transferable understanding of biology [1]. This is especially important in single-cell biology, where many tasks are exploratory and lack predefined labels for fine-tuning [1]. Evaluations have revealed that in zero-shot settings, scFMs can be outperformed by simpler methods on tasks like cell type clustering and batch integration, exposing limitations that might be masked by a fine-tuning-based evaluation protocol [1].

FAQ 3: What are biology-informed evaluation metrics? Biology-informed metrics are designed to assess whether a model's outputs align with established biological knowledge. Moving beyond purely technical metrics, they evaluate biological plausibility. Examples include:

scGraph-OntoRWR: Measures the consistency of cell type relationships captured by the model with known relationships in cell ontology [4].
Lowest Common Ancestor Distance (LCAD): Assesses the severity of cell type misclassification by measuring the ontological proximity between the predicted and true cell type [4].
Genetic Interaction Prediction: Evaluates a model's ability to correctly identify non-additive effects in double perturbations, such as buffering or synergy [2].

FAQ 4: My scFM performs well on technical metrics but yields biologically implausible results. What should I do? This discrepancy indicates a potential failure of the model to capture meaningful biological relationships, despite optimizing for technical benchmarks. You should:

Incorporate Biology-Informed Metrics: Immediately adopt metrics like scGraph-OntoRWR to quantitatively evaluate biological consistency [4].
Perform a Failure Mode Analysis: Closely examine the specific predictions that are biologically implausible. A common finding is that models may predict "buffering" interactions far more often than "synergistic" ones, and these predictions are often incorrect [2].
Validate with Simple Baselines: Always compare your model's performance against simple baselines, such as a linear model or even just predicting the mean expression, to ensure the complexity of a foundation model is justified [2].

Troubleshooting Guides

Issue 1: Poor Performance in Predicting Genetic Interactions

Problem: Your model fails to accurately predict genetic interactions (e.g., in double-gene perturbations), often defaulting to predicting no interaction or showing high error rates.

Investigation & Resolution:

Step	Action	Expected Outcome
1	Benchmark against an additive baseline. Compare your model's predictions on double perturbations to a simple model that sums the effects of the two single perturbations [2].	The complex model should significantly outperform the simple additive baseline. If it does not, the foundation model is not capturing the interaction effect.
2	Classify the types of errors. Analyze whether the model is missing specific classes of genetic interactions, such as synergistic or opposite effects [2].	The distribution of predicted interaction types (buffering, synergistic, opposite) should roughly match the validated ground truth data.
3	Check prediction variance. Examine if the model's predictions vary meaningfully across different perturbations or if they are consistently close to zero or the control condition [2].	Predictions should show appropriate variance that reflects the biological changes in the data.

Issue 2: Failure in Zero-Shot Cell Type Identification

Problem: When using your scFM zero-shot (without fine-tuning) to generate cell embeddings, the resulting clusters do not separate known cell types effectively.

Investigation & Resolution:

Step	Action	Expected Outcome
1	Compare to established baselines. Generate cell embeddings using highly variable genes (HVG), scVI, or Harmony and compare the clustering performance (e.g., using AvgBIO score or ASW) to your scFM's embeddings [1].	The scFM's zero-shot embeddings should be competitive with or superior to embeddings from these established methods.
2	Quantify batch effect removal. Use batch integration metrics to check if the primary structure in the embeddings is driven by biological signal (cell type) or technical batch effects [1].	Biological signal should explain more variance than batch effects in the embedding space.
3	Use ontology-based metrics. Apply biology-informed metrics like LCAD to understand if misclassifications are at least biologically "close" (e.g., confusing two T-cell subtypes) or "distant" (e.g., confusing a T-cell with a neuron) [4].	Misclassifications should have a low LCAD, meaning they are between biologically similar cell types.

Issue 3: Model Cannot Generalize to Unseen Perturbations

Problem: The model performs poorly when predicting the effects of perturbing a gene that was not included in its training data.

Investigation & Resolution:

Step	Action	Expected Outcome
1	Test a linear model with embeddings. Extract the gene and perturbation embeddings learned by the scFM during pre-training. Use them in a simple linear model (e.g., Eq. 1 from [2]) to predict perturbation effects.	The performance of the linear model using the scFM's embeddings indicates the quality of the representations learned during pre-training.
2	Compare to a "mean" prediction baseline. A very strong baseline is to simply predict the average expression across the training set for any unseen perturbation [2].	Your model must significantly outperform this naive baseline to be considered useful.
3	Leverage external perturbation data. If available, pre-training on a separate, large-scale perturbation dataset (even from a different cell line) can create better perturbation embeddings that improve generalizability [2].	Pre-training on diverse perturbation data should lead to better performance on new perturbations compared to pre-training only on atlas data.

The following tables summarize key quantitative findings from recent benchmark studies, providing a reference for expected performance.

Table 1: Benchmarking Perturbation Prediction Performance. This table summarizes results from a study comparing several models and baselines on their ability to predict transcriptome changes after genetic perturbations [2].

Model / Baseline	Performance on Double Perturbations	Performance on Unseen Single Perturbations
Additive Model	Outperformed all deep learning models [2]	Not Applicable
No Change Model	Competitive with deep learning models [2]	Not Applicable
Deep Learning Models (e.g., GEARS, scGPT)	Underperformed compared to the simple additive baseline [2]	Did not consistently outperform a simple linear model or mean prediction baseline [2]
Linear Model with Pre-trained Embeddings	Not Reported	Performance was competitive with or better than the original deep learning models [2]

Table 2: Zero-Shot Performance on Cell-Level Tasks. This table summarizes the performance of scFMs against established baselines in a zero-shot setting, where models are not fine-tuned on the target data [1]. Performance is ranked from best (1) to worst (4).

Method	Cell Type Clustering (AvgBIO Score)	Batch Integration (Batch Mixing Score)
Highly Variable Genes (HVG)	1 [1]	1 [1]
scVI	2 [1]	2 [1]
Harmony	3 [1]	3 [1]
scGPT / Geneformer	4 [1]	4 [1]

Experimental Protocols

Protocol 1: Benchmarking Genetic Interaction Prediction

This protocol outlines the steps to evaluate a model's ability to predict non-additive effects in double-gene perturbations, as performed in [2].

Methodology:

Data Preparation: Use a dataset with measured transcriptome effects for single- and double-gene perturbations (e.g., Norman et al. CRISPR activation data in K562 cells) [2].
Train-Test Split: Fine-tune models on all single perturbations and a subset of double perturbations. Hold out a separate set of double perturbations for testing.
Baseline Calculation: Compute the "additive baseline" for each held-out double perturbation by summing the LFCs of its two constituent single perturbations.
Genetic Interaction Identification: In the ground truth data, identify true genetic interactions by finding double perturbations whose phenotype differs from the additive expectation more than expected by chance (e.g., using a false discovery rate of 5%) [2].
Model Prediction & Evaluation:
- Calculate the L2 distance between predicted and observed expression for top-expressed genes.
- For genetic interactions, compute true-positive and false-discovery rates by comparing model-predicted interactions (deviation from additive > threshold) to the ground truth interactions.

Protocol 2: Zero-Shot Evaluation of Cell Embeddings

This protocol describes how to evaluate the biological relevance of cell embeddings generated by an scFM without any fine-tuning [4] [1].

Methodology:

Embedding Extraction: Input a labeled dataset (with cell types and batch information) into the pre-trained scFM and extract the cell embeddings from its output layer.
Dimensionality Reduction: Apply UMAP or t-SNE to the embeddings for visualization.
Quantitative Evaluation with Technical Metrics:
- Cell Type Clustering: Use metrics like Average Silhouette Width (ASW) and Average BIO score to assess how well the embeddings separate known cell types.
- Batch Integration: Use metrics like batch mixing score and Principal Component Regression (PCR) to assess how well the embeddings mix cells from different batches while preserving biological signal.
Quantitative Evaluation with Biology-Informed Metrics:
- scGraph-OntoRWR: Build a k-nearest neighbor graph from the embeddings and use a random walk with restart algorithm on the cell ontology graph to measure the consistency between the embedding-derived relationships and known ontological relationships [4].
- Lowest Common Ancestor Distance (LCAD): For misclassified cells, calculate the distance between the true and predicted cell type in the cell ontology hierarchy. Lower distances indicate less severe errors [4].

Key Experiment Workflow

The following diagram illustrates a comprehensive biology-informed evaluation workflow for a single-cell foundation model, integrating both technical and biological validation steps.

Biology-Informed Model Evaluation Workflow

Research Reagent Solutions

Table 3: Key Computational Tools and Datasets for Evaluation. This table lists essential resources for conducting a rigorous evaluation of single-cell foundation models.

Item Name	Type	Function / Explanation
Norman et al. Data	Dataset	A key dataset for benchmarking perturbation prediction, containing transcriptome profiles for 100 single-gene and 124 double-gene perturbations in K562 cells [2].
Additive Baseline Model	Computational Baseline	A simple model that predicts the effect of a double perturbation by summing the logarithmic fold changes of the two single perturbations. Crucial for benchmarking complex models [2].
scGraph-OntoRWR	Evaluation Metric	A novel metric that quantifies the consistency between cell-type relationships learned by the model and the known relationships in a cell ontology, providing a biology-informed performance measure [4].
Cell Ontology	Knowledge Base	A structured, controlled vocabulary for cell types. Serves as the source of prior biological knowledge for metrics like scGraph-OntoRWR and LCAD [4].
Replogle & Adamson Data	Dataset	Large-scale single-cell CRISPRi perturbation datasets (in K562 and RPE1 cells) used for benchmarking a model's ability to generalize to unseen perturbations [2].

What is the current performance landscape of scFMs for perturbation prediction?

Current benchmarks reveal that single-cell foundation models (scFMs) often fail to outperform simpler, traditional methods for predicting transcriptional responses to genetic perturbations [2] [6]. While scFMs are powerful tools for integrating diverse datasets and exploring biological systems, their performance varies significantly across tasks, with no single model consistently dominating others [4].

Table 1: scFM Performance Summary on Perturbation Tasks

Model	Reported Performance on Perturbation Tasks	Key Limitations Identified
scGPT	Does not consistently outperform simple additive or linear baselines [2].	Struggles with predicting strong or atypical perturbation effects; predictions show insufficient variance [2].
scFoundation	Claimed capability for perturbation prediction [2].	Requires specific gene sets, limiting application to datasets with missing genes; predictions vary less than ground truth [2].
Geneformer	Can be repurposed for perturbation prediction but not its primary design [2].	Underperforms in zero-shot settings like batch integration and cell type clustering [1].
UCE & scBERT	Repurposable for perturbation tasks via linear decoder [2].	Not originally designed for perturbation prediction, leading to suboptimal performance [2].
General scFMs	Zero-shot embeddings do not provide consistent improvement over baselines, especially under distribution shift [6].	All models struggle with predicting genetic interactions (synergistic/opposite) and often default to predicting "buffering" effects [2].

A critical finding from recent studies is that the goal of creating a generalizable representation for predicting the outcome of novel experiments remains largely elusive [2]. Furthermore, in zero-shot settings—where models are used without any task-specific fine-tuning—scFMs can show significant reliability challenges and be outperformed by simpler methods [1].

Why is zero-shot performance particularly important for perturbation research?

Zero-shot evaluation is critical for assessing whether pretraining has endowed the model with a true, transferable understanding of biology, which is essential for exploratory research where labels are unknown or fine-tuning is not feasible [1]. In the context of perturbation research, this capability is paramount for in silico prediction of experiments that have not yet been conducted, a key promise of foundation models.

However, evidence suggests that the masked language model pretraining framework used by many scFMs may not inherently produce cell embeddings that are useful for zero-shot perturbation prediction [1] [2]. This represents a significant limitation for researchers who need to apply models to novel diseases, uncharacterized cell types, or unprecedented combinatorial perturbations where no training data exists.

Experimental Protocols & Benchmarking

What are the standard protocols for benchmarking scFMs on perturbation tasks?

Benchmarking scFMs for perturbation effect prediction involves evaluating their ability to predict gene expression changes after single or double genetic perturbations. The following workflow outlines a standardized protocol adapted from recent rigorous benchmarks [2].

Detailed Protocol:

Data Preparation: Utilize publicly available perturbation datasets, such as the Norman et al. dataset, which includes transcriptome-wide expression values for 100 single-gene and 124 double-gene perturbations in K562 cells using CRISPR activation [2].
Data Splitting: For double perturbation prediction, fine-tune models on all 100 single perturbations and a randomly selected half (e.g., 62) of the double perturbations. The remaining double perturbations are held out for testing. This process should be repeated multiple times (e.g., five) with different random partitions for robustness [2].
Establish Baselines: Compare scFMs against deliberately simple baselines. These are crucial for a fair assessment:
- 'No Change' Baseline: Always predicts the same expression as the control condition.
- 'Additive' Baseline: For each double perturbation, predicts the sum of the individual logarithmic fold changes (LFCs) from the single perturbations. Neither baseline uses the double perturbation data for training [2].
Model Fine-tuning: Follow the authors' recommended procedures to fine-tune the scFMs (e.g., scGPT, scFoundation, GEARS) on the training set.
Evaluation Metric: The primary prediction error is often the L2 distance between the predicted and observed expression values for the 1,000 most highly expressed genes. Other metrics like Pearson correlation can provide additional insight [2].
Genetic Interaction Analysis: Operationalize genetic interactions by identifying double perturbations where the observed phenotype differs significantly from the additive expectation. Evaluate models on their ability to correctly classify these interactions as buffering, synergistic, or opposite [2].

What key reagents and computational tools are required?

Table 2: Essential Research Reagents & Solutions for scFM Perturbation Analysis

Item Name	Function/Description	Example Source/Identifier
Perturbation Datasets	Provides ground-truth transcriptome data for training and benchmarking models.	Norman et al. data (GEO); Replogle et al. K562/RPE1 CRISPRi data [2].
Reference Datasets	Used for evaluating zero-shot capabilities in tasks like cell type annotation and batch integration.	Pancreas dataset; Tabula Sapiens; PBMC (12k) dataset [4] [1].
Benchmarking Framework	Standardized codebase to ensure fair and reproducible model comparisons.	PertEval-scFM framework [6].
Baseline Models	Simple, non-foundation model benchmarks (e.g., HVG selection, linear models) essential for performance context.	Highly Variable Genes (HVG); Harmony; scVI; Simple Additive Model [1] [2].
Compute Resources	High-performance computing (HPC) or cloud resources are necessary for training and fine-tuning large models.	GPUs (e.g., NVIDIA A100/H100) for efficient training [4].

Troubleshooting Common Experimental Issues

Our scFM predictions lack variance and fail to capture strong genetic interactions. What could be wrong?

This is a common issue reported in benchmarks, where models like scGPT and GEARS predict expression changes that vary considerably less than the ground-truth data and largely fail to capture correct synergistic interactions [2].

Potential Causes and Solutions:

Cause: Inadequate Pretraining Context. The model may not have learned sufficient causal biological relationships from its pretraining corpus.
- Solution: Consider using or creating models that incorporate additional biological context, such as protein-protein interactions or gene ontology annotations, during pretraining. Alternatively, a linear model using perturbation embeddings pretrained on similar data (e.g., from another cell line) has been shown to outperform foundation models in some cases [2].
Cause: Task-Framework Misalignment. The standard masked language modeling objective may not optimally prepare the model for the specific task of predicting out-of-distribution perturbation effects.
- Solution: Explore alternative fine-tuning strategies or architectures specifically designed for perturbation prediction, such as cycle-consistent representation alignment frameworks like scREPA [31].

How do we choose between a complex scFM and a simpler model for our project?

The decision should be guided by a clear understanding of your project's constraints and goals. The following diagram can help guide this decision.

Decision Guide:

Prioritize simpler models (HVG, scVI, linear models) if:
- Your primary task is zero-shot perturbation prediction and you need high reliability [1] [2].
- You have limited computational resources for fine-tuning [4].
- Your dataset is small or specific to a single biological context.
Consider investing in scFMs if:
- You have a large, diverse dataset that allows for extensive fine-tuning.
- Your goal is a multi-task analysis (e.g., joint batch integration, cell type annotation, and perturbation analysis) where scFMs can serve as a versatile, plug-and-play module [4] [26].
- You can leverage their learned gene embeddings as a potentially useful starting point for a simpler linear prediction model [2].

Future Directions & Model Selection FAQ

Given the limitations, what is the future of scFMs in perturbation biology?

The current limitations highlight a need for more specialized model architectures and higher-quality, broader perturbation datasets [6]. Future directions may include:

Developing New Pretraining Objectives: Moving beyond standard masked language modeling to objectives that more explicitly capture causal, predictive relationships in biological networks.
Incorporating Multi-modal Data: Integrating foundational knowledge from genomics (e.g., scATAC-seq), proteomics, and published literature to create richer contextual models [26].
Focus on Interpretability: Improving methods to interpret the biological relevance of the latent representations learned by scFMs, moving from a "black box" to a tool for biological discovery [4] [26].

Are there any scFMs that consistently outperform others across all tasks?

No. Comprehensive benchmarks conclude that no single scFM consistently outperforms all others across diverse application scenarios [4]. Model performance is highly dependent on the specific task (e.g., batch integration vs. perturbation prediction), dataset size, and biological context.

The most effective strategy is not to search for a single "best" model but to adopt a benchmarking-driven approach. For your specific dataset and perturbation task, run a focused benchmark comparing several promising scFMs (e.g., scGPT, Geneformer) against the simple baselines described in the experimental protocols above. This is the only way to make a data-driven selection tailored to your research needs [4] [2].

Conclusion

The current generation of single-cell foundation models represents a significant step forward in computational biology, yet our analysis underscores that their zero-shot application to perturbation prediction remains fraught with challenges. The key takeaway is that these models, in their raw pretrained state, are not a magic bullet; they often cannot surpass simple baselines and require careful handling. However, the path forward is clear. Success hinges on moving beyond a pure zero-shot paradigm through strategic fine-tuning, incorporating structured biological knowledge, and adhering to rigorous, standardized benchmarking. Future progress will depend on developing more specialized models trained on higher-quality, broader perturbation datasets and creating evaluation frameworks that prioritize biologically meaningful insights over purely technical metrics. For biomedical research, this critical evolution will be essential to truly leverage scFMs for accelerating drug discovery and unraveling complex disease mechanisms.