Single-cell foundation models (scFMs), pretrained on millions of cells, promise to revolutionize the in-silico prediction of cellular responses to genetic and drug perturbations.
Single-cell foundation models (scFMs), pretrained on millions of cells, promise to revolutionize the in-silico prediction of cellular responses to genetic and drug perturbations. However, rigorous benchmarking reveals significant limitations in their zero-shot capabilities, where models are used without task-specific fine-tuning. This article synthesizes recent evidence showing that zero-shot scFMs often fail to outperform deliberately simple baselines, struggle with distribution shifts, and offer limited improvements for predicting unseen perturbations. We explore the foundational causes of these shortcomings, survey emerging methodological fixes like efficient fine-tuning, provide a framework for troubleshooting model performance, and outline rigorous validation standards. For researchers and drug development professionals, this critical appraisal provides essential guidance for navigating the current landscape of scFMs, enabling more informed and effective application in perturbation biology and therapeutic discovery.
Q1: Why does my zero-shot single-cell foundation model (scFM) underperform on basic cell type clustering compared to established methods?
A: Current benchmarking reveals that in zero-shot settings, scFMs like Geneformer and scGPT can be outperformed in cell type clustering by simpler methods, including the selection of Highly Variable Genes (HVG) or using established tools like Harmony and scVI [1]. This is measured by metrics such as the average BIO (AvgBio) score and average silhouette width (ASW) [1]. The underlying issue may be that the masked language model pretraining framework does not inherently produce cell embeddings that are optimal for this specific biological task without further, task-specific fine-tuning [1].
Q2: When predicting genetic perturbation effects, why do complex scFMs fail to beat simple baseline models?
A: Multiple independent studies have found that for predicting transcriptome changes after single or double genetic perturbations, several scFMs (including scGPT and scFoundation) and other deep learning models do not consistently outperform deliberately simple baselines [2] [3]. These baselines include:
Q3: My scFM performs well on batch integration for some datasets but fails on others. What is happening?
A: The performance of scFMs on batch integration is inconsistent. While they may successfully integrate data from different experiments using the same technique, they often struggle to correct for batch effects between different experimental techniques [1]. Quantitative evaluations show that methods like Harmony and scVI frequently outperform scFMs on this task, and in many cases, even simply selecting HVGs can achieve superior batch integration scores [1]. The effectiveness of an scFM can be highly dependent on the specific characteristics of the dataset and the nature of the batch effects.
Q4: Is there a single scFM that consistently outperforms all others across diverse tasks?
A: No. Comprehensive benchmarks indicate that no single scFM consistently outperforms all others across every task [4]. The best model for a given project depends on factors such as dataset size, task complexity, the need for biological interpretability, and available computational resources [4]. Model selection should therefore be a tailored decision based on the specific experimental context and goals.
The tables below summarize key findings from recent benchmark studies, providing a direct comparison between scFMs and simpler baseline methods.
Table 1: Performance Comparison on Cell-Level Tasks (Zero-Shot)
| Task | Top-Performing Methods | Underperforming Methods | Key Metric(s) | Notes |
|---|---|---|---|---|
| Cell Type Clustering | HVG, scVI, Harmony [1] | scGPT, Geneformer [1] | AvgBIO, ASW [1] | scFMs show inconsistent performance across different datasets [1]. |
| Batch Integration | HVG, scVI, Harmony [1] | Geneformer, scGPT [1] | Batch mixing scores, PCR [1] | Geneformer often increases batch effect variance compared to input data [1]. |
Table 2: Performance Comparison on Perturbation Prediction Tasks
| Task | Simple Baseline Models | Complex Models Benchmarked | Key Finding | Reference |
|---|---|---|---|---|
| Double Perturbation Prediction | Additive Model, 'No Change' Model [2] | GEARS, CPA, scGPT, scFoundation, scBERT, Geneformer, UCE* [2] | "All models had a prediction error substantially higher than the additive baseline." [2] | [2] |
| Unseen Single Perturbation Prediction | Perturbed Mean, Linear Model [3] | CPA, GEARS, scGPT [3] | "Simple baselines performed comparatively or outperformed state-of-the-art methods." [3] | [3] |
| Unseen Combinatorial Perturbation Prediction | Matching Mean Baseline [3] | GEARS, scGPT [3] | The matching mean baseline "outperformed all other baselines by a considerable margin." [3] | [3] |
This protocol is adapted from evaluations of scFM zero-shot capabilities [1].
This protocol is based on benchmarks comparing scFMs to simple baselines [2] [3].
Table 3: Essential Computational Tools for scFM Benchmarking
| Item Name | Function / Application | Key Insight from Benchmarking |
|---|---|---|
| Highly Variable Genes (HVG) | A baseline method for feature selection prior to clustering or integration. | Surprisingly robust; often outperforms or matches scFMs in zero-shot cell type clustering and batch integration tasks [1]. |
| Harmony | Algorithm for integrating single-cell data across multiple batches or experiments. | A strong, established baseline for batch correction that frequently outperforms zero-shot scFM embeddings [1]. |
| scVI | A probabilistic generative model for scRNA-seq data analysis, including integration. | Consistently performs well on cell-level tasks and serves as a powerful benchmark against which to compare new scFMs [4] [1]. |
| 'Perturbed Mean' Baseline | A simple model that predicts the average expression profile of all perturbed cells. | Crucial for perturbation benchmarks; reveals that complex models may not capture much beyond this average effect for unseen perturbations [2] [3]. |
| 'Additive' Baseline | A model that predicts double perturbation effects as the sum of single perturbation effects. | Essential for evaluating combinatorial perturbation prediction; often outperforms specialized deep learning models [2]. |
| Systema Framework | An evaluation framework designed to control for systematic variation in perturbation data. | Helps distinguish models that capture true perturbation-specific biology from those that merely learn dataset-wide biases [3]. |
FAQ 1: What is the "Distribution Shift Problem" in the context of single-cell perturbation prediction?
The distribution shift problem refers to the significant performance deterioration that single-cell foundation models (scFMs) exhibit when they encounter strong or atypical genetic perturbations that differ from the data they were trained on. In a zero-shot setting, these models struggle to generalize to these out-of-distribution examples, often failing to accurately predict the transcriptional outcomes of such perturbations [5] [6].
FAQ 2: Why do current single-cell foundation models (scFMs) fail on atypical perturbations?
Benchmarking studies indicate that current-generation scFMs primarily capture systematic variation—the consistent transcriptional differences between pools of perturbed and control cells caused by selection biases or biological confounders—rather than genuine, perturbation-specific effects. When presented with an atypical perturbation that does not share these common systematic patterns, the models lack the specific biological knowledge to make an accurate prediction [7].
FAQ 3: Are there any standardized benchmarks to evaluate this issue?
Yes, the PertEval-scFM framework is a standardized benchmark specifically designed to evaluate models, including their performance on distribution shifts. It assesses whether zero-shot scFM embeddings genuinely enhance perturbation effect prediction compared to simpler baseline models [5] [6].
FAQ 4: What is a key pitfall in evaluating my own perturbation prediction model?
A common pitfall is relying on standard reference-based metrics (like Pearson correlation on differential expression) without accounting for systematic variation. A model can achieve a high score by simply learning the average difference between all perturbed and control cells, which does not reflect its ability to predict the unique effect of a specific, unseen perturbation. The Systema framework is a new evaluation method designed to mitigate this bias [7].
FAQ 5: What are some emerging solutions to improve generalization?
Emerging approaches focus on better integration of biological knowledge and representation learning. For instance:
Problem: My model's performance drops significantly on unseen or strong genetic perturbations.
Diagnosis: This is a classic symptom of the distribution shift problem. The model is likely overfitting to the systematic variation present in its training data and cannot extrapolate to novel scenarios.
Solution - Implement Rigorous Evaluation: Follow the protocol below to diagnose whether your model is capturing true biological signals or just systematic bias.
Experimental Protocol: Isolating Perturbation-Specific Effects with Systema
Objective: To evaluate a model's ability to predict perturbation-specific effects, free from the confounding influence of systematic variation.
Materials:
Methodology:
Interpretation of Results:
The following diagram illustrates this diagnostic experimental workflow.
Problem: My model lacks biological reasoning for its predictions, hindering trust and utility.
Diagnosis: The model has learned statistical associations but not the underlying mechanistic biology, making it an unreliable tool for hypothesis generation.
Solution - Incorporate Synthetic Biological Reasoning: Use a knowledge distillation approach to infuse biological reasoning into a smaller, more efficient model, as demonstrated by SynthPert [8].
Experimental Protocol: Knowledge Distillation with Synthetic Reasoning Traces
Objective: To enhance a model's biological reasoning capabilities for perturbation prediction through supervised fine-tuning on synthetic chain-of-thought explanations.
Materials:
Methodology:
Interpretation of Results:
The workflow for this solution is shown below.
Table 1: Benchmarking scFMs against Simple Baselines for Zero-Shot Perturbation Prediction [5] [6] [7]
| Model / Baseline | Core Principle | Performance on Unseen Perturbations | Performance under Distribution Shift | Key Limitation |
|---|---|---|---|---|
| Zero-shot scFMs | Contextualized embeddings from models pre-trained on large scRNA-seq atlases. | Limited improvement over baselines [5] [6]. | Significant performance deterioration, especially on strong/atypical perturbations [5] [6]. | Captures systematic variation rather than perturbation-specific effects [7]. |
| Perturbed Mean | Non-parametric baseline; predicts the average expression of all perturbed cells. | Surprisingly competitive or superior for unseen one-gene perturbations [7]. | Robust, as it represents the average systematic effect. | Cannot predict any perturbation-specific details; only the average treatment effect. |
| Matching Mean | Non-parametric baseline; for combo perturbation X+Y, averages the mean profiles of X and Y. | Outperforms complex models for unseen two-gene perturbations [7]. | Robust for combinations of seen single-gene perturbations. | Relies on having seen the constituent single-gene perturbations. |
Table 2: Emerging Methods for Improved Generalization [8] [9]
| Model | Core Methodology | Reported Advantage | Applicability |
|---|---|---|---|
| SynthPert | Supervised fine-tuning of LLMs on synthetic, quality-filtered chain-of-thought explanations. | Achieves 87% accuracy on unseen RPE1 cells; state-of-the-art on PerturbQA [8]. | Enhances biological reasoning and cross-cell-type generalization. |
| scREPA | Aligns VAE latent embeddings with biologically meaningful representations from pre-trained scFMs using cycle-consistent alignment. | Outperforms existing methods in predicting DEGs and whole-transcriptome responses; generalizes well to unseen conditions and noisy data [9]. | Improves representation quality for robust prediction under data limitations. |
Table 3: Essential Resources for scFM Perturbation Prediction Research
| Resource Name | Type | Function in Research | Example from Search Results |
|---|---|---|---|
| PertEval-scFM | Software Benchmark | Standardized framework to evaluate and compare the performance of single-cell foundation models for perturbation prediction in a zero-shot setting [5] [6]. | https://github.com/aaronwtr/PertEval [5] |
| Systema | Software Framework / Evaluation Metric | An evaluation framework that mitigates the confounding effects of systematic variation, providing a clearer readout of a model's ability to capture perturbation-specific biology [7]. | https://github.com/mlbio-epfl/systema [7] |
| PerturbQA | Dataset & Benchmark | A benchmark that reformulates perturbation experiments into natural language tuples, enabling the evaluation of LLM-based biological reasoning [8]. | Used as the primary evaluation dataset in SynthPert [8]. |
| Adamson & Norman Datasets | Experimental Data | Key single-cell perturbation datasets often used for training and benchmarking. They target specific biological processes but are known to contain significant systematic variation [7]. | Used in the Systema benchmark to demonstrate systematic variation [7]. |
| Replogle (RPE1) Dataset | Experimental Data | A large-scale, genome-wide perturbation screen used to study model generalization and artifacts like cell-cycle arrest induced by perturbations [7]. | Used in the Systema benchmark to demonstrate cell-cycle systematic bias [7]. |
Q1: What does "zero-shot" performance mean for a single-cell foundation model (scFM), and why is it important? Zero-shot evaluation tests a foundation model's capabilities without any further task-specific training or fine-tuning. You use the model's pre-trained internal representation, or "embedding," directly for downstream analysis. This is critical for exploratory research where predefined labels don't exist or the ability to fine-tune is excluded, such as in discovery settings where the biological outcomes are unknown [1].
Q2: Our team is getting poor results using scGPT and Geneformer for zero-shot perturbation prediction. Are we doing something wrong? Not necessarily. Benchmarking studies have consistently found that these models in a zero-shot setting do not outperform, and are sometimes outperformed by, deliberately simple baseline methods. This appears to be a fundamental limitation of current model architectures and pretraining, not user error [6] [2].
Q3: What are the main types of failures when predicting genetic interactions? Models struggle with several specific scenarios:
Q4: Is there a way to improve the accuracy of these models for our perturbation experiments? Yes, recent research suggests moving from an "open-loop" to a "closed-loop" framework can significantly improve performance. This involves iteratively fine-tuning the foundation model with experimental perturbation data (e.g., from Perturb-seq). This approach has been shown to triple the Positive Predictive Value (PPV) of predictions [10].
Problem: When using scGPT or Geneformer embeddings for tasks like cell type identification or removing batch effects without fine-tuning, the performance is inconsistent and worse than established methods.
Investigation & Diagnosis:
Solution: For zero-shot tasks, rely on proven, simpler methods as your primary baseline.
Problem: The model's predictions for gene expression changes after single or double genetic perturbations do not match experimental validation data.
Investigation & Diagnosis:
Solution:
Problem: The model cannot accurately extrapolate to predict the effects of perturbing a gene that was not present in its fine-tuning dataset.
Investigation & Diagnosis: This is a known weakness. Claims that foundation models can inherently generalize to unseen perturbations through pretraining are not yet fully supported by benchmarks [2].
Solution:
This table summarizes the performance of scFM embeddings compared to established methods on common tasks, as measured in independent evaluations. ASW (Average Silhouette Width) and AvgBIO score measure how well cell types are separated, while Batch Score measures how well technical batch effects are removed [1].
| Task | Metric | HVG | Harmony | scVI | scGPT | Geneformer |
|---|---|---|---|---|---|---|
| Cell Type Clustering | AvgBIO / ASW | Outperforms | Outperforms | Outperforms | Inconsistent | Underperforms |
| Batch Integration | Batch Score | Best | Good | Good | Moderate | Underperforms |
| Key Finding | A simple, established method often provides the most robust zero-shot performance for these tasks. |
This table compares the performance of various models against simple baselines for predicting gene expression changes after genetic perturbations. L2 Distance measures the error in predicting expression values, while AUROC (Area Under the Receiver Operating Characteristic Curve) measures the ability to classify genetic interactions correctly [2] [10].
| Model / Method | L2 Distance (vs. Additive) | AUROC | Notes |
|---|---|---|---|
| Additive Model | (Baseline) | N/A | Sums individual gene effects. A surprisingly strong benchmark [2]. |
| No Change Model | Higher | 0.63* | Predicts control expression. Foundation models can perform similarly [2]. |
| GEARS, scGPT, scFoundation | Higher | <0.63 | Do not consistently outperform additive/no-change baselines [2]. |
| Open-loop ISP (Geneformer) | N/A | 0.63 | PPV of 3%, similar to differential expression [10]. |
| Closed-loop ISP (Geneformer) | N/A | 0.86 | PPV of 9% (3x improvement) with only ~20 perturbation examples [10]. |
| Key Finding | Deliberately simple models are not outperformed by current complex scFMs for this task. |
Objective: Evaluate the quality of scFM cell embeddings for cell type clustering and batch integration without any fine-tuning.
Methodology:
scib.metrics package) to quantify the removal of batch effects while preserving biological variance.Objective: Dramatically improve the accuracy of in-silico perturbation predictions by incorporating experimental data.
Methodology:
This table lists key reagents and tools required for experimental validation of computational predictions, a critical step in the closed-loop framework.
| Item | Function / Description | Example Use Case |
|---|---|---|
| CRISPRa/i System | A system for gene activation (a) or interference (i) to perturb gene function. | Genetically perturbing target genes in primary human T cells to study activation [10]. |
| Perturb-seq Protocol | A single-cell RNA sequencing method that captures the transcriptomic effects of genetic perturbations in pooled screens. | Generating experimental data for fine-tuning foundation models in a closed-loop [10]. |
| ATAC-seq Kit | Assay for Transposase-Accessible Chromatin to map genome-wide chromatin accessibility. | Providing complementary epigenetic data to understand regulatory mechanisms [11]. |
| ChIPmentation Kit | A technology that combines chromatin immunoprecipitation (ChIP) with tagmentation for efficient library prep. | Mapping histone modifications or transcription factor binding sites in low-input samples [12]. |
| Flow Cytometry Assays | Measures protein expression and cytokine production (e.g., IL-2, IFN-γ) at the single-cell level. | Providing orthogonal, non-transcriptomic validation of perturbation effects on cell function [10]. |
Q1: What does "zero-shot" evaluation mean for single-cell foundation models (scFMs), and why is it critical for my research?
A1: Zero-shot evaluation tests a foundation model's performance on a new task or dataset without any additional training (fine-tuning). This is critical for exploratory biology because, in many discovery settings, the biological labels or outcomes you are looking for are unknown, making fine-tuning impossible. A model's zero-shot capability demonstrates its true generalizability and the fundamental biological understanding it gained during pre-training [1].
Q2: My zero-shot scFM embeddings are performing poorly in cell type clustering. What could be the cause?
A2: Recent benchmarks have identified that scFMs like scGPT and Geneformer can underperform simpler methods like Highly Variable Gene (HVG) selection or established integration tools like Harmony and scVI in zero-shot cell type clustering [1]. This suggests that the masked language model pre-training objective used by many scFMs may not automatically produce high-quality cell embeddings for every downstream task without fine-tuning. If you encounter this, consider using a simpler baseline method as a benchmark for your specific dataset [1].
Q3: Can current scFMs accurately predict genetic interaction effects in a zero-shot setting?
A3: Current evidence suggests they cannot. When predicting effects of double genetic perturbations, foundation models and other deep learning models have failed to outperform a deliberately simple "additive" baseline, which just sums the effects of single perturbations. Furthermore, these models struggle to correctly predict synergistic genetic interactions, often defaulting to predicting no interaction [2].
Q4: Are there any scFMs that show promising zero-shot capabilities?
A4: Some newer models are being designed with a stronger focus on zero-shot performance. For example, scShift is a deep identifiable model that, when scaled up, has demonstrated remarkable zero-shot capabilities in characterizing cell types and biological states while overcoming batch effects across datasets [13]. This indicates that model architecture and training objectives are key factors for successful zero-shot application.
Symptoms: When visualizing your scFM cell embeddings, the data clusters strongly by batch or dataset source instead of by biological cell type.
Diagnosis: The model has failed to learn batch-invariant representations of cells in its zero-shot setting.
Solutions:
Symptoms: Your model's predictions for gene expression changes after a perturbation are inaccurate, particularly for double-gene perturbations, and are worse than a simple additive model.
Diagnosis: The model has not learned the underlying biological rules that govern genetic interactions.
Solutions:
Table 1: Zero-shot Performance Comparison on Cell Type Clustering (AvgBIO Score) [1]
| Model / Method | Pancreas Dataset | PBMC (12k) Dataset | Tabula Sapiens Dataset | Immune Dataset |
|---|---|---|---|---|
| HVG (Baseline) | 0.671 | 0.620 | 0.672 | 0.625 |
| scVI (Baseline) | 0.659 | 0.621 | 0.653 | 0.581 |
| Harmony (Baseline) | 0.622 | 0.615 | 0.579 | 0.549 |
| scGPT | 0.581 | 0.649 | 0.651 | 0.545 |
| Geneformer | 0.551 | 0.556 | 0.502 | 0.508 |
A higher AvgBIO score indicates better cell type separation. HVG often outperforms or matches foundation models.
Table 2: Performance on Double Genetic Perturbation Prediction (L2 Distance) [2]
| Model / Method | Prediction Error (L2 Distance) | Outperforms Additive Baseline? |
|---|---|---|
| Additive Baseline | ~0.28 | N/A |
| No Change Baseline | ~0.40 | No |
| GEARS | ~0.38 | No |
| scGPT | ~0.42 | No |
| Geneformer* | ~0.45 | No |
| scBERT* | ~0.43 | No |
Models marked with * were repurposed with a linear decoder. Lower L2 distance is better. No model outperformed the simple additive baseline.
Objective: To evaluate the quality of cell embeddings generated by an scFM for separating known cell types without fine-tuning.
Materials: A labeled single-cell dataset with known cell types (e.g., a subset of Tabula Sapiens). The pre-trained scFM model (e.g., scGPT, Geneformer). Baseline methods (HVG, scVI, Harmony).
Methodology:
Objective: To test an scFM's ability to predict transcriptome-wide gene expression changes caused by single or double genetic perturbations.
Materials: A perturbation dataset (e.g., Norman et al. or Replogle et al. data). The scFM (e.g., scGPT, scFoundation). Baselines (Additive model, No-change model, simple linear model).
Methodology:
Table 3: Essential Resources for Zero-Shot scFM Evaluation
| Item | Function in Evaluation |
|---|---|
| Pre-trained Model Weights (e.g., for scGPT, Geneformer) | Provides the foundational model to be tested in a zero-shot context without further training [1] [2]. |
| Benchmarking Datasets (e.g., Norman et al. perturbation data, Tabula Sapiens) | Serves as the standardized, ground-truthed test bed for evaluating model performance on specific tasks like perturbation prediction or cell type identification [1] [2]. |
| Simple Baseline Models (e.g., Additive model, HVG selection, linear models) | Critical controls to determine if the complexity of an scFM provides any tangible benefit over simple, established methods [1] [2]. |
| Quantitative Metrics (e.g., L2 distance, AvgBIO score, ASW) | Provides objective, numerical measures of model performance for tasks like prediction accuracy and cluster quality, enabling direct comparison between models [1] [2]. |
| Integration Tools (e.g., Harmony, scVI) | Established methods for comparison against scFMs for tasks like batch correction and dimensionality reduction [1]. |
Q1: What is Parameter-Efficient Fine-Tuning (PEFT) and why is it important for large language models (LLMs)?
PEFT refers to a set of techniques that adapt a large pre-trained model to a new task by training only a small number of parameters, rather than the entire model. This is crucial because LLMs can have billions of parameters, making full fine-tuning computationally expensive, time-consuming, and prone to overfitting, especially on smaller datasets. PEFT methods, such as adapters, achieve performance comparable to full fine-tuning while dramatically reducing computational costs and storage requirements [14] [15].
Q2: How do adapter layers work, and where are they inserted in a transformer model?
Adapters are small, bottleneck-shaped neural network modules inserted into the layers of a pre-trained transformer model. A typical adapter consists of two fully connected layers with a non-linear activation in between. The first layer projects the input down to a lower dimension (the bottleneck), and the second layer projects it back up to the original input dimension [14] [15]. In the original adapter method proposed by Houlsby et al. (2019), two adapter layers are inserted into each transformer block: one after the multi-head attention module and one after the feed-forward network [14].
Q3: What are the primary advantages of using adapters over full fine-tuning?
Q4: In the context of single-cell biology, what is a key limitation of foundation models that PEFT could help address?
Recent rigorous benchmarks have revealed that single-cell foundation models (scFMs), like scGPT and Geneformer, often fail to outperform simple baseline models in zero-shot settings for predicting genetic perturbation effects [2] [1] [5]. This means that using these models "out-of-the-box" without any further training yields unreliable results. PEFT, through methods like adapter tuning, provides a pathway to specialize these general scFMs on specific, high-quality perturbation datasets, potentially bridging this performance gap without the cost of full fine-tuning.
Q5: How does the parameter efficiency of adapters compare to simply fine-tuning the top layers of a model?
Adapters can achieve superior performance with a comparable or even smaller number of trained parameters. For example, a BERT model trained with adapters matched the performance of a fully fine-tuned model while only training 3.6% of the parameters. In a direct experiment with a DistilBERT model, fine-tuning adapter layers outperformed fine-tuning only the top two layers on a sentiment classification task, despite using a similar number of parameters (599,424 for adapters vs. 592,130 for the top layers) [14].
Problem: Your model, after adapter tuning, is not achieving the expected performance on the downstream task.
Potential Causes and Solutions:
Problem: Your GPU memory usage is still high even though you are using adapters.
Potential Causes and Solutions:
This protocol outlines the steps to insert and train adapter layers in a transformer-based model like DistilBERT, based on the experiments by Sebastian Raschka [14].
Materials:
distilbert-base-uncased from Hugging Face).Methodology:
This protocol describes how to compare different fine-tuning strategies, as performed in [14].
Methodology:
The table below summarizes results from a sentiment classification task using a DistilBERT model, comparing different fine-tuning strategies [14].
Table 1: Comparison of Fine-Tuning Methods on DistilBERT
| Fine-Tuning Method | Trainable Parameters | Test Accuracy | Training Time (min) |
|---|---|---|---|
| Top Layers Only | 592,130 | 86.4% | 2.89 |
| Adapters (Bottleneck=32) | 599,424 | 88.4% | 5.69 |
| Full Fine-Tuning | ~66.9 Million | 93.0% | 7.12 |
Table 2: Essential Components for Adapter-based Fine-Tuning Experiments
| Item | Function | Example / Note |
|---|---|---|
| Pre-trained LLM | The foundation model that provides general language or biological knowledge. | Models like DistilBERT, LLaMA, or single-cell models like scGPT [14] [1]. |
| Task-Specific Dataset | The labeled data used to adapt the model to a new domain or task. | For scFMs, this would be a high-quality dataset of genetic perturbations [2]. |
| Adapter Modules | The small, trainable networks inserted into the base model. | Bottleneck architecture with a configurable hidden dimension [14]. |
| Deep Learning Framework | Software used to implement and train the model. | PyTorch or TensorFlow with the Hugging Face Transformers library. |
| GPU Acceleration | Hardware to handle the computational load of training and inference. | Consumer GPUs (e.g., 16GB T4) are often sufficient for adapter tuning [15]. |
Single-cell foundation models (scFMs), such as Geneformer and scGPT, are pre-trained on massive single-cell transcriptomics datasets with the goal of learning universal biological patterns. A primary application is in silico perturbation (ISP) prediction, where these models forecast how a cell's transcriptome changes in response to a genetic intervention. In discovery research where labels are unknown, models must often operate in a zero-shot setting without task-specific fine-tuning.
Recent rigorous benchmarking, however, has revealed a significant limitation: the zero-shot performance of these scFMs for perturbation prediction frequently fails to outperform deliberately simple baselines [1] [2] [6]. This technical support guide addresses this performance gap by providing actionable strategies for incorporating external biological knowledge to enhance prediction reliability.
This is a commonly reported issue. Quantitative benchmarks have demonstrated that even state-of-the-art scFMs do not consistently outperform simple models like an "additive" baseline (summing individual logarithmic fold changes) or predicting no change from the control condition [2].
Root Cause Analysis:
Solution: Implement a "Closed-Loop" Fine-Tuning Framework.
Evidence: In a T-cell activation study, this approach increased the positive predictive value (PPV) of ISP three-fold, from 3% to 9%, while also improving sensitivity and specificity [10].
This scenario is challenging due to sample scarcity, but external knowledge can be leveraged.
Root Cause: scFMs require sufficient contextual data to make meaningful predictions. For rare diseases, the model may not have encountered enough relevant patterns during pre-training.
Solution: Utilize Engineered Cell Models and Cross-Validation.
Example Workflow for a Rare Hematologic Disorder:
This is a known weakness of current scFMs, as they tend to be biased towards predicting minimal changes [6].
Root Cause: The models may be averaging over possible outcomes or are not trained on sufficient examples of strong genetic interactions.
Solution: Integrate Lineage-Specific Gene Embeddings and Prioritize Data Quality.
It is crucial to verify that the model's latent space captures biologically meaningful relationships.
Root Cause: Without validation, it's unclear if the model has learned relevant biology or just technical artifacts.
Solution: Use Ontology-Informed Metrics.
Table 1: Benchmarking scFMs against simple baselines for double perturbation prediction. Prediction error is measured as L2 distance on top 1,000 genes (lower is better). Adapted from [2].
| Model / Baseline | Prediction Error (L2) | Outperforms Additive Baseline? |
|---|---|---|
| Additive Model (Simple Baseline) | ~1.5 | (Baseline) |
| No Change Model (Simple Baseline) | ~4.5 | No |
| scGPT | ~4.5 | No |
| Geneformer* | ~4.2 | No |
| scBERT* | ~4.5 | No |
| UCE* | ~4.5 | No |
| GEARS | ~3.8 | No |
| scFoundation | ~3.2 | No |
Note: Models marked with * were repurposed with a linear decoder for this task.
Table 2: Impact of "closed-loop" fine-tuning on perturbation prediction accuracy for T-cell activation. PPV: Positive Predictive Value; NPV: Negative Predictive Value. Data from [10].
| Fine-Tuning Approach | PPV | NPV | Sensitivity | Specificity |
|---|---|---|---|---|
| Open-Loop (Standard) ISP | 3% | 98% | 48% | 60% |
| Differential Expression | 3% | 78% | 40% | 50% |
| Closed-Loop ISP | 9% | 99% | 76% | 81% |
This methodology details how to incorporate experimental perturbation data to improve a pre-trained scFM [10].
This protocol outlines a strategy to overcome data scarcity in rare disease research [10].
Table 3: Essential resources for enhancing scFM perturbation predictions.
| Reagent / Resource | Function in Context | Example Use Case |
|---|---|---|
| Geneformer-30M-12L | A pre-trained scFM based on the Transformer architecture. Can be fine-tuned for specific tasks. | Base model for closed-loop fine-tuning in T-cell activation and rare disease modeling [10]. |
| Perturb-seq Data | Single-cell RNA sequencing data from genetic perturbation screens. Provides ground-truth data on transcriptional outcomes. | Incorporated during fine-tuning to teach the model the causal links between gene perturbation and cell state [10]. |
| Engineered Cell Models | In vitro models of disease created via CRISPR/Cas9 editing. Bypasses the need for large numbers of patient samples. | Used to generate abundant, relevant transcriptomic data for rare diseases like RUNX1-FPD [10]. |
| Cell Ontologies | Structured, controlled vocabularies for cell types. Define the hierarchical relationships between different cell classes. | Used to compute biology-aware validation metrics like scGraph-OntoRWR and LCAD [4]. |
| Linear Model with Embeddings | A simple, interpretable baseline model that uses pre-trained gene/perturbation vectors. | Serves as a strong benchmark; can outperform complex scFMs in predicting unseen perturbations [2]. |
| RUNX1-FPD Model | A specific engineered model for RUNX1-familial platelet disorder using human HSCs. | Used to identify therapeutic targets (e.g., mTOR, CD74-MIF axis) via ISP [10]. |
Q1: My zero-shot perturbation predictions are outperformed by a simple "no change" baseline. What could be wrong? This is a known limitation identified in recent benchmarks [2]. The "no change" baseline, which always predicts expression identical to the control condition, and the "additive" baseline, which sums individual logarithmic fold changes, have been found to be highly competitive. Current foundation models often struggle to learn representations that generalize better than these simplistic assumptions for predicting unseen perturbation effects [2] [6].
Q2: How can I improve my model's prediction of genetic interactions from double perturbations? Benchmarks reveal that models frequently misclassify interaction types, often predicting "buffering" interactions but rarely correctly identifying "synergistic" or "opposite" interactions [2]. If your model shows this behavior, it may not be capturing the underlying biological complexity. Consider enriching your training data with confirmed interaction examples or exploring alternative model architectures that move beyond current foundation model limitations.
Q3: Can pretrained gene embeddings from large models enhance prediction for unseen single-gene perturbations? Evidence suggests limited benefits. A linear model using embeddings from scFoundation or scGPT did not consistently outperform a linear model with embeddings derived directly from the training data [2]. The most effective strategy identified was pretraining the perturbation embedding matrix (P) on existing large-scale perturbation data (e.g., from a different cell line), which provided more predictive power than atlas-scale single-cell pretraining [2].
Q4: My model works well on the training data but fails to generalize. What steps should I take? This indicates poor out-of-distribution performance, a common challenge. First, implement the simple linear baseline and "mean prediction" baseline to quantify the performance gap [2]. Ensure your training data encompasses a wide range of perturbation strengths and types, as models tend to struggle with strong or atypical effects [6]. Also, verify that your dataset does not suffer from the biases often present in public drug combination databases [16].
Problem Model fails to accurately predict transcriptome changes for genetic perturbations or drug combinations not seen during training.
Investigation & Diagnosis
Solution If outperformed by baselines, consider:
Problem Model cannot correctly identify or classify genetic interactions (e.g., synergistic, buffering).
Investigation & Diagnosis
Solution
Table 1: Benchmarking Results for Double Perturbation Prediction (based on Norman et al. data) [2]
| Model / Baseline | Prediction Error (L2 Distance) | Performance in Genetic Interaction Prediction |
|---|---|---|
| Additive Baseline | Reference | Does not predict interactions by definition |
| No Change Baseline | Higher than Additive | Not better than random |
| scGPT | Higher than Additive | Not better than random |
| GEARS | Higher than Additive | Not better than random |
| Geneformer | Higher than Additive | Not better than random |
| scBERT | Higher than Additive | Not better than random |
Table 2: Performance of Models on Unseen Single Perturbations [2]
| Model / Approach | Performance on Adamson (K562) & Replogle (K562, RPE1) Data |
|---|---|
| Mean Prediction Baseline | Competitive, often not outperformed |
| Linear Model (P from training data) | Competitive |
| scGPT (with its own decoder) | Did not consistently outperform baseline |
| GEARS (with its own decoder) | Did not consistently outperform baseline |
| Linear Model (with scGPT's G, training P) | Outperformed scGPT's native decoder |
| Linear Model (P pretrained on Replogle data) | Consistently outperformed all other models |
Objective: Quantify if a complex model provides value over simple baselines [2].
Materials: Dataset with single and double perturbation phenotypes (e.g., log-transformed expression values).
Method:
LFC(A+B) = LFC(A) + LFC(B), where LFC is the logarithmic fold change versus control.Objective: Test the predictive utility of pretrained embeddings using a simple, interpretable model [2].
Materials:
Y_train (genes x perturbations) for training.G (optional).P (optional).Method:
G or P are not provided, create them via dimension reduction (e.g., PCA) on the training data.Y_train ≈ (G * W * P^T) + b.
G: Gene embedding matrix (number of genes x K dimensions).P: Perturbation embedding matrix (number of perturbations x L dimensions).W: The learned weight matrix (K x L).b: The vector of row means from Y_train.p_new (from P or a lookup) to predict gene expression: y_pred = (G * W * p_new^T) + b.
Experimental Design Flow
Zero Shot Prediction Pipeline
Evidence for ScFM Limitations
Table 3: Essential Computational Tools and Datasets
| Item Name | Function / Description | Relevance to Troubleshooting |
|---|---|---|
| Simple Linear Baseline Models | Provides a critical performance benchmark for any complex model. | Confirms if a complex model adds value; essential for diagnosing poor performance [2]. |
| Perturbation Datasets (e.g., Norman, Replogle) | Standardized, publicly available datasets for training and benchmarking. | Allows for reproducible benchmarking and comparison of model performance against published results [2]. |
| Gene & Perturbation Embeddings (G, P) | Low-dimensional representations of genes and perturbations. | Their predictive utility can be tested in a linear model framework to isolate embedding quality from architecture complexity [2]. |
| Sequential Model Optimization (SMO) Framework | An active learning approach that selects the most informative experiments to run next. | Efficiently explores large drug combination spaces, enriching for synergistic hits with minimal experimentation [16]. |
| Large Language Model (LLM) Embeddings | Context-enriched embeddings for drugs and cell lines generated from models like GPT-3.5. | Can be used as input features to represent drugs and cell lines in a unified pipeline for tasks like drug synergy prediction [17]. |
Q: My foundation model for predicting chemical perturbation effects performs poorly on novel compounds or cell lines it wasn't trained on. What could be wrong?
A: This is a known limitation in current single-cell foundation models (scFMs). Recent benchmarking studies show that even advanced models like scGPT and Geneformer often fail to outperform simple baselines in zero-shot settings—where models are used without any further training on new data [2] [1]. Performance issues are particularly pronounced when predicting effects for unseen single or double perturbations [2].
Diagnosis Steps:
Solution Steps:
Q: My model is unable to accurately predict non-additive genetic interactions (like synergy or buffering) from double perturbation data. How can I improve this?
A: Many deep learning models struggle to correctly identify true genetic interactions, often exhibiting a strong bias towards predicting "buffering" interactions and rarely correctly predicting synergistic effects [2].
Diagnosis Steps:
Solution Steps:
Q: The cell embeddings produced by my foundation model in a zero-shot setting fail to separate cell types effectively or remove batch effects. Why?
A: Zero-shot evaluation of foundation models like scGPT and Geneformer reveals that their cell embeddings often underperform compared to established methods for tasks like cell type clustering and batch correction. The primary structure in the embeddings may be driven by batch effects rather than biological signal [1].
Diagnosis Steps:
Solution Steps:
Table 1: Benchmarking Model Performance on Double Genetic Perturbation Prediction (based on Norman et al. data in [2])
| Model / Baseline | Prediction Error (L2 distance) vs. Additive Baseline | Strength in Predicting Genetic Interactions |
|---|---|---|
| Additive Baseline | Reference (Best) | None (by definition) |
| No Change Baseline | Higher | Poor (cannot predict synergy) |
| GEARS | Higher | Poor (rarely predicts correct synergy) |
| scGPT | Higher | Poor |
| Geneformer* | Higher | Poor |
| scBERT* | Higher | Poor |
Note: Models marked with * were not originally designed for the task and were repurposed with a linear decoder. L2 distance was calculated for the top 1,000 most highly expressed genes. None of the deep learning models outperformed the simple additive baseline [2].
Table 2: Zero-Shot Performance on Cell Type Clustering (AvgBIO Score) (representative data from [1])
| Model / Method | Pancreas Dataset | Immune Dataset | Tabula Sapiens | PBMC (12k) |
|---|---|---|---|---|
| HVG (Highly Variable Genes) | Best | Best | Best | High |
| scVI | High | Medium | High | Medium |
| Harmony | High | Medium | Medium | Medium |
| scGPT | Low | Low | Low | Best |
| Geneformer | Low | Low | Low | Low |
Table 3: Key Reagent Solutions for Perturbation Screening
| Reagent / Material | Function in Experiment |
|---|---|
| CRISPR Activation/Interference System | Used to perform targeted genetic perturbations (gene knockout or activation) in cell lines (e.g., K562, RPE1) to generate ground-truth data for model training and validation [2]. |
| SMILES Strings & RDKit | Simplified Molecular-Input Line-Entry System strings provide a standardized text representation of a compound's chemical structure. RDKit is a cheminformatics library used to process SMILES and generate molecular fingerprints (e.g., FCFP) for model input, enabling generalization to novel compounds [18]. |
| LanthaScreen Eu Kinase Binding Assay | A TR-FRET based binding assay. Useful for studying kinase interactions, including with inactive forms of the kinase, which may not be possible with activity assays [19]. |
| Functional-Class Fingerprints (FCFP) | A type of molecular fingerprint generated from a compound's SMILES string. It captures functional topology information, which can be rescaled by dosage to create an embedding vector representing the chemical perturbation for models like PRnet [18]. |
This protocol is adapted from the benchmarking study detailed in [2].
Data Preparation:
Establish Baselines:
log(expression_perturbed + 1) - log(expression_control + 1).Model Training and Fine-tuning:
Evaluation:
This protocol is based on the zero-shot evaluation framework presented in [1].
Embedding Generation:
Baseline Methods:
Task Evaluation:
FAQ 1: What are the most common failure modes for zero-shot perturbation prediction in single-cell foundation models (scFMs)?
Research indicates several key failure modes when using scFMs for zero-shot perturbation prediction:
FAQ 2: My model's perturbation predictions seem biologically implausible. What could be wrong?
This is a recognized limitation in current scFMs. Benchmarking studies have found that a significant challenge for these models is predicting "strong or atypical perturbation effects" [6]. Furthermore, models may exhibit biases, such as consistently predicting buffering interactions while rarely and inaccurately predicting synergistic ones [2]. This suggests the model may not have learned the underlying gene regulatory networks robustly enough for out-of-distribution predictions.
FAQ 3: Can I trust the gene embeddings from a foundation model for my perturbation analysis?
Caution is advised. External benchmarks that extracted gene embeddings from scFMs like scGPT and scFoundation found that using these embeddings in a simple linear model did not consistently outperform using embeddings derived directly from the perturbation training data [2]. This indicates that the pretraining on single-cell atlases may not yet provide a decisive advantage over task-specific training for perturbation prediction.
FAQ 4: Are there any strategies to improve the accuracy of my perturbation predictions?
Evidence suggests that moving from an "open-loop" to a "closed-loop" framework can significantly enhance performance. This involves fine-tuning the foundation model by incorporating a limited amount of experimental perturbation data (e.g., from Perturb-seq). Studies have shown that even a small number of perturbation examples (around 20) integrated during fine-tuning can dramatically improve metrics like positive predictive value, sensitivity, and specificity [10].
Problem: Model fails to predict genetic interactions in double-gene knockout experiments.
| Observation | Model output for a double perturbation is essentially the sum of the two single perturbations (additive effect), or shows no change from the control. The model fails to identify known synergistic or buffering interactions. |
|---|---|
| Root Cause | The model has not effectively learned the underlying, non-linear relationships between genes that lead to emergent effects in combinations. The pretraining objective may not adequately prepare the model for this specific task in a zero-shot setting [2]. |
| Solution | 1. Employ a Simple Baseline: Always compare your model's performance against a simple additive baseline model, which predicts the double perturbation effect as the sum of the two individual logarithmic fold changes. 2. Utilize Closed-Loop Fine-Tuning: If possible, move away from a pure zero-shot setting. Fine-tune the foundation model on any available double perturbation data, even from other cell types or conditions, to help it learn the concept of genetic interactions [10]. |
| Prevention | When selecting a model, consult recent independent benchmarks that explicitly test for genetic interaction prediction rather than relying solely on claims from model publications [2]. |
Problem: Predictions are inaccurate for perturbations on genes or in cell types not well-represented in the model's pretraining data.
| Observation | The model's predictions for unseen perturbations are no better than simply predicting the average expression across the training set. Performance degrades significantly under distribution shift [6] [2]. |
|---|---|
| Root Cause | The foundation model's knowledge is constrained by the scope and diversity of its pretraining corpus. It lacks the ability to generalize reliably to completely novel biological contexts without additional guidance. |
| Solution | 1. Use a Perturbation-Informed Baseline: Implement a linear model that leverages embeddings pretrained on large-scale perturbation datasets (if available), which may generalize better than atlas-based pretraining [2]. 2. Leverage Similarity Metrics: For diseases with no known treatments, use models that explicitly transfer knowledge from similar, well-annotated diseases via metric learning, rather than relying on a pure zero-shot approach [20]. |
| Prevention | Critically evaluate the pretraining data composition of any scFM before applying it to your specific research problem to identify potential data gaps. |
Problem: Cell embeddings from the foundation model do not integrate well across batches in a zero-shot setting.
| Observation | When visualizing the embeddings (e.g., via UMAP), cells cluster strongly by batch or experiment of origin rather than by cell type or biological state. |
|---|---|
| Root Cause | The model's pretraining objective (e.g., masked language modeling) does not automatically learn to produce batch-invariant representations. In zero-shot use, it has not been explicitly trained to remove technical noise [1]. |
| Solution | 1. Use Established Batch Correction Tools: Pass the scFM embeddings into dedicated batch integration algorithms like Harmony or scVI as a post-processing step. 2. Try a Simpler Approach: Benchmark the scFM's performance against a simple baseline of Highly Variable Genes (HVG), which has been shown to outperform foundation models in some batch integration tasks [1]. |
| Prevention | Do not assume that a foundation model's embeddings are batch-corrected by default. Always include batch integration as a formal step in your analysis workflow. |
Table 1: Benchmarking scFMs against Baselines in Double Perturbation Prediction (L2 Distance, lower is better) [2]
| Model / Baseline | Average L2 Distance (Top 1000 Genes) |
|---|---|
| Additive Model (Baseline) | ~1.5 |
| No Change Model (Baseline) | ~2.5 |
| scGPT | ~2.5 |
| Geneformer* | ~3.0 |
| GEARS | ~2.3 |
| scFoundation | ~2.1 |
Note: Models marked with * were repurposed with a linear decoder.
Table 2: Zero-Shot Batch Integration Performance (Average BIO Score, higher is better) [1]
| Method | Pancreas Dataset | PBMC Dataset | Tabula Sapiens Dataset |
|---|---|---|---|
| HVG (Baseline) | ~0.7 | ~0.75 | ~0.8 |
| scVI | ~0.68 | ~0.72 | ~0.78 |
| Harmony | ~0.65 | ~0.7 | ~0.75 |
| scGPT | ~0.55 | ~0.72 | ~0.65 |
| Geneformer | ~0.45 | ~0.5 | ~0.55 |
Table 3: Impact of Closed-Loop Fine-Tuning on Prediction Accuracy [10]
| Metric | Open-Loop ISP | Closed-Loop ISP (with ~20 examples) |
|---|---|---|
| Positive Predictive Value (PPV) | 3% | 9% |
| Sensitivity | 48% | 76% |
| Specificity | 60% | 81% |
| Negative Predictive Value (NPV) | 98% | 99% |
Protocol 1: Benchmarking Perturbation Effect Prediction
This protocol is based on the benchmark described in [2].
Protocol 2: Evaluating Zero-Shot Cell Embeddings for Batch Integration
This protocol is based on the evaluation performed in [1].
Zero-Shot Failure and Solution Workflow
Genetic Interaction Prediction Failure
Table 4: Essential Research Reagents and Computational Tools
| Item | Function in Evaluation |
|---|---|
| CRISPRa/i Perturbation Datasets (e.g., Norman et al.) | Provides ground truth data for benchmarking model predictions of single and double gene perturbation effects on transcriptomes [2]. |
| Benchmark Datasets with Batch Effects (e.g., Pancreas Dataset) | Allows for the evaluation of a model's zero-shot ability to integrate data from multiple sources and correct for technical variation [1]. |
| Perturb-seq Data | Single-cell RNA-sequencing data from genetic perturbation screens. Used for closed-loop fine-tuning of foundation models to dramatically improve prediction accuracy [10]. |
| High-Performance Computing (HPC) Cluster | Essential for running and fine-tuning large foundation models, which are computationally intensive and often require GPU acceleration [2] [10]. |
| Linear Regression Model | A deliberately simple baseline model. Critical for determining if a complex foundation model provides any meaningful performance improvement for a given task [2]. |
| Batch Correction Tools (e.g., Harmony, scVI) | Established algorithms used to correct for technical batch effects. Can be applied as a post-processing step to scFM embeddings or used as a performance baseline [1]. |
Single-cell foundation models (scFMs) like scGPT and Geneformer represent a significant advance in computational biology, promising to leverage large-scale pretraining to understand cellular states and predict experimental outcomes. A particularly ambitious goal for these models is zero-shot perturbation prediction—forecasting a cell's transcriptional response to genetic or chemical perturbation without any task-specific training data.
However, recent rigorous benchmarking studies reveal a sobering reality: in zero-shot settings, these complex models often fail to outperform simpler, traditional methods. This technical support article frames these limitations within the critical context of data quality and curation, providing researchers with troubleshooting guidance to navigate these challenges in their experimental workflows.
Q: My zero-shot scFM embeddings are underperforming for cell type annotation. What might be wrong? A: This is a recognized systematic limitation. Benchmarking studies indicate that scFM embeddings in zero-shot settings frequently underperform established dimensionality reduction techniques. Before assuming an implementation error, compare your results against a baseline method.
Q: Why do my scFM's perturbation effect predictions seem inaccurate? A: Predicting transcriptional responses to perturbation is a fundamentally challenging task. Recent evidence suggests that the foundational pretraining of many scFMs may not be optimally transferring to this specific objective.
Q: Can I trust a high benchmark score reported in an scFM's original publication? A: Exercise caution. Some original model publications may have used benchmark settings or comparisons that were particularly favorable. Independent, post-publication benchmarking is crucial for a realistic performance assessment.
Q: My model shows inflated performance during training but fails on external validation. What is the cause? A: This is a classic symptom of inadequate data curation. A common culprit is the presence of duplicate or non-independent data points in your training set, which leads to overfitting and poor generalizability.
Q: How does data quality from experimental sources impact my model's reliability? A: Profoundly. The reproducibility of the underlying experimental data used for training and benchmarking is a primary constraint on your model's achievable accuracy.
Objective: To quantitatively evaluate the quality of scFM-generated cell embeddings for downstream tasks like cell type clustering and batch integration, comparing them against established baseline methods.
Materials:
Methodology:
Expected Outcome: Simpler methods like HVG selection, scVI, and Harmony will often outperform or match scFMs in zero-shot settings. The following table summarizes typical benchmark results:
Table 1: Sample Benchmark Results for Cell Embedding Quality (Based on [1])
| Evaluation Metric | scGPT | Geneformer | HVG+PCA | scVI | Harmony |
|---|---|---|---|---|---|
| AvgBIO Score (Cell Type) | Variable, often lower | Underperforms | Consistently high | High | High |
| Batch Mixing Score | Moderate on seen data | Consistently low | High | High on technical batches | High on technical batches |
| Key Weakness | Inconsistent across datasets | Poor preservation of cell type info | N/A | Struggles with complex biological batches | Lower PCR score on complex datasets |
Objective: To assess an scFM's ability to predict gene expression changes after single or double genetic perturbations, comparing its accuracy against simple additive and "no change" baselines.
Materials:
Methodology:
A+B, predict the sum of the LFCs from the single perturbations A and B.Expected Outcome: The simple additive baseline is often very difficult to beat. Most scFMs will exhibit a higher prediction error (L2 distance) than this baseline [2].
Table 2: Performance Overview in Perturbation Prediction Benchmarks (Based on [2])
| Model Type | Performance vs. Additive Baseline | Performance on Unseen Perturbations | Identification of Genetic Interactions |
|---|---|---|---|
| scFMs (scGPT, scFoundation) | Higher prediction error | Not consistently better than linear models | Struggles; mostly predicts buffering types |
| Other DL Models (GEARS, CPA) | Higher prediction error | Not designed for unseen perturbations | Varies, but often suboptimal |
| Simple Additive Model | Baseline | Not applicable | Cannot predict by definition |
| Simple "No Change" Model | Higher prediction error | N/A | Cannot predict synergistic interactions |
| Linear Model with Pre-trained Embeddings | N/A | Can outperform full scFMs | N/A |
Table 3: Key Resources for scFM Research and Benchmarking
| Resource Name | Type | Function/Benefit | Example/Note |
|---|---|---|---|
| PertEval-scFM | Benchmarking Framework | Standardized framework for evaluating perturbation effect prediction in zero-shot settings [6]. | Helps avoid inflated performance claims. |
| Norman et al. Dataset | Benchmark Data | Provides ground-truth expression for 100 single and 124 double gene perturbations in K562 cells [2]. | Essential for reproducibility in perturbation tasks. |
| Replogle et al. Datasets | Benchmark Data | CRISPRi datasets in K562 and RPE1 cells for evaluating prediction on unseen perturbations [2]. | Tests model generalizability across cell lines. |
| Linear Baseline Model | Computational Method | A simple, interpretable model using gene and perturbation embeddings; serves as a critical sanity check [2]. | Often matches or beats complex scFMs. |
| TxGNN | Foundation Model (Drug Repurposing) | A graph-based model for zero-shot drug repurposing; demonstrates successful zero-shot application in a related domain [20]. | Provides a design pattern for effective zero-shot learning. |
| Rigorous Data Curation Pipeline | Methodology | A protocol for deduplication, unit harmonization, and error checking of input data [21] [22]. | The most critical factor for building reliable models. |
The following diagram illustrates the critical path for developing and evaluating reliable single-cell foundation models, highlighting how rigorous data curation and benchmarking are foundational to success.
Diagram 1: scFM Development Workflow
The limitations observed in zero-shot perturbation prediction by current single-cell foundation models are not merely algorithmic but are fundamentally tied to the quality, reproducibility, and curation of the data on which they are built and evaluated. Researchers can navigate this landscape more effectively by:
By adopting these practices, the field can build a more reliable foundation, steering the development of scFMs from models that underperform simple methods to robust tools that genuinely advance biological discovery.
This resource addresses common challenges researchers face when designing and evaluating models for biological reasoning, particularly in the context of zero-shot prediction of genetic perturbation effects.
FAQ 1: Why do our sophisticated foundation models underperform simple baselines in zero-shot perturbation prediction?
Answer: Current single-cell foundation models (scFMs) often underperform simpler methods like Highly Variable Genes (HVG) selection or linear models in zero-shot settings due to several potential architectural and training limitations [1] [2] [23].
Troubleshooting Steps:
FAQ 2: How can we improve a model's generalization to unseen cell types or strong perturbations?
Answer: Generalization fails when models learn dataset-specific artifacts rather than underlying biological principles. This is evident in the significant performance drop observed under distribution shift [23].
FAQ 3: Our model fails to predict genetic interactions. What could be the cause?
Answer: Predicting non-additive genetic interactions (e.g., synergistic or buffering effects) is a complex challenge. Many models default to predicting buffering interactions or struggle to deviate from additive expectations [2].
The tables below summarize key findings from recent benchmarks, highlighting the performance gap between proposed scFMs and simpler models.
Table 1: Zero-Shot Cell Embedding Performance on Clustering (AvgBIO Score) [1]
| Model / Method | Pancreas Dataset | PBMC (12k) Dataset | Tabula Sapiens Dataset |
|---|---|---|---|
| HVG (Baseline) | 0.671 | 0.631 | 0.617 |
| scVI | 0.645 | 0.619 | 0.605 |
| Harmony | 0.632 | 0.600 | 0.589 |
| scGPT | 0.580 | 0.640 | 0.590 |
| Geneformer | 0.512 | 0.528 | 0.523 |
Table 2: Perturbation Effect Prediction Performance (L2 Distance, lower is better) [2]
| Model / Method | Double Perturbation Prediction | Unseen Perturbation Prediction |
|---|---|---|
| Additive Model (Baseline) | ~1.4 | Not Applicable |
| No Change Model (Baseline) | ~1.7 | ~1.5 |
| scGPT | ~1.7 | ~1.6 |
| Geneformer* | ~1.8 | ~1.7 |
| GEARS | ~1.6 | ~1.6 |
*Note: Models marked with * were repurposed for this task with a linear decoder.
Protocol 1: Zero-Shot Cell Embedding Evaluation
This protocol evaluates the quality of cell embeddings for downstream tasks without any fine-tuning [1].
Protocol 2: Zero-Shot Perturbation Effect Prediction
This protocol tests a model's ability to predict transcriptome-wide changes after a genetic perturbation without being explicitly trained on that perturbation [2] [23].
HRM Brain-Inspired Architecture
Zero-Shot Perturbation Prediction
Table 3: Essential Computational Tools & Datasets for scFM Evaluation
| Item Name | Function / Purpose | Example Use Case |
|---|---|---|
| PertEval-scFM [23] | A standardized benchmarking framework for evaluating zero-shot perturbation prediction. | Systematically comparing embedding quality from different scFMs against simple baselines. |
| Norman et al. Dataset [2] | A canonical dataset with single and double gene perturbations in K562 cells. | Benchmarking model performance on predicting double perturbation effects and genetic interactions. |
| Replogle et al. Dataset [2] | A large-scale single-cell CRISPRi perturbation dataset across multiple cell lines (K562, RPE1). | Evaluating model generalization to unseen perturbations and cross-cell-line prediction. |
| Linear Decoder / Adapter [2] [24] | A simple, trainable layer attached to a frozen foundation model. | Enabling efficient task-specific fine-tuning while preserving pretrained knowledge for zero-shot evaluation. |
| Additive & No-Change Baselines [2] | Deliberately simple prediction models that serve as a critical sanity check. | Establishing a performance floor; any proposed complex model should outperform these. |
Q: What is a single-cell Foundation Model (scFM), and how is it supposed to work? A: A single-cell Foundation Model (scFM) is a large-scale deep learning model, typically based on a transformer architecture, that is pretrained on vast datasets containing millions of single-cell transcriptomes [26]. The concept is inspired by large language models. In these models, individual cells are treated like "sentences," and genes or genomic features are treated as "words" or "tokens" [26]. Through self-supervised pretraining (like predicting masked genes), the model aims to learn a universal representation of cellular states that can be adapted to various downstream tasks—such as predicting how a cell's gene expression will change after a genetic perturbation—without the need for extensive new experimental data [26].
Q: My goal is to predict the effects of unseen single or double genetic perturbations. Which model should I use? A: Current evidence suggests that you may achieve better or comparable results with deliberately simple baseline models rather than complex deep-learning scFMs. A 2025 benchmark study found that none of the five evaluated foundation models (including scGPT and scFoundation) and two other deep learning models outperformed simple additive or linear baselines for predicting transcriptome changes after single or double perturbations [2]. For predicting the effect of a perturbation not seen during training, a simple linear model or even just predicting the mean expression from the training data can outperform or match sophisticated foundation models [2].
Q: I have heard scFMs perform well in zero-shot settings. Is this true for tasks like batch integration or cell type identification? A: Rigorous zero-shot evaluation (using the model without any fine-tuning) reveals significant limitations. A 2025 study found that for cell type clustering and batch integration, the zero-shot performance of scGPT and Geneformer was inconsistent and often worse than established, simpler methods [1]. For instance, a simple approach of selecting Highly Variable Genes (HVG) frequently outperformed these foundation models in batch integration tasks. In cell type clustering, methods like scVI and Harmony generally provided more robust embeddings than the zero-shot scFM embeddings [1].
Q: What are the most common failure modes when using scFMs for perturbation prediction? A: Benchmarks have identified several specific failure modes [2]:
Q: If simple models are currently better, what is the value of a foundation model? A: The primary value proposed for scFMs is their potential as a unified framework for multiple biological tasks. While they may not yet excel at specific tasks like perturbation prediction, their broad pretraining is intended to build a general understanding of biology [26]. The goal is a single model that can be adapted to many problems. However, current research indicates that this promise has not yet been fully realized for perturbation modeling, and simpler, task-specific models are more reliable for this application [2] [1].
Problem: Poor performance in predicting double perturbation effects.
Problem: Model fails to generalize to unseen perturbations.
Problem: Inconsistent zero-shot performance on tasks like cell type annotation or batch correction.
Table 1: Benchmarking scFMs against baselines for double perturbation prediction. Data adapted from a 2025 study comparing model performance on predicting transcriptome changes after double gene perturbations. Prediction error is measured as L2 distance; lower is better [2].
| Model / Baseline | Prediction Error (L2 Distance) | Outperforms Additive Baseline? |
|---|---|---|
| Additive Model (Simple Baseline) | ~1.5 (Reference) | N/A |
| scGPT | ~2.8 | No |
| GEARS | ~2.5 | No |
| scFoundation | ~2.4 | No |
| Geneformer* | ~3.2 | No |
| No Change Model | ~3.5 | No |
Note: Models marked with an asterisk were not originally designed for the task and were repurposed with a linear decoder [2].
Table 2: Performance of a linear model with various embeddings for unseen perturbation prediction. Data shows that a simple linear model can be highly effective. The "Perturbation Data" embedding refers to pretraining on other perturbation datasets (e.g., using K562 data to predict RPE1 effects) [2].
| Embedding Source for Linear Model | Performance vs. Mean Baseline | Performance vs. scGPT/GEARS |
|---|---|---|
| Perturbation Data (e.g., Replogle) | Consistently outperforms | Outperforms |
| scFoundation Gene Embedding | Outperforms | Comparable or better |
| scGPT Gene Embedding | Outperforms | Comparable or better |
| Training Data Only | Comparable | Comparable or better |
Table 3: Zero-shot performance on cell type clustering (Average BIO Score). Higher scores are better. Data shows that simple methods often outperform foundation models. Adapted from a 2025 zero-shot evaluation [1].
| Model / Method | PBMC (12k) Dataset | Tabula Sapiens Dataset | Pancreas Dataset |
|---|---|---|---|
| HVG (Highly Variable Genes) | ~0.75 | ~0.63 | ~0.58 |
| scVI | ~0.72 | ~0.61 | ~0.56 |
| Harmony | ~0.70 | ~0.55 | ~0.54 |
| scGPT (Zero-shot) | ~0.76 | ~0.57 | ~0.52 |
| Geneformer (Zero-shot) | ~0.65 | ~0.51 | ~0.48 |
Protocol 1: Benchmarking Perturbation Effect Prediction Against an Additive Baseline
This protocol is crucial for critically evaluating any model's performance on predicting double perturbation effects [2].
LFC_A+B (predicted) = LFC_A + LFC_B.Protocol 2: Zero-Shot Evaluation for Cell Type Clustering
This protocol assesses the intrinsic quality of a model's cell embeddings without fine-tuning [1].
The following diagram illustrates a recommended workflow for selecting and evaluating a model for perturbation prediction, incorporating key questions and steps based on recent benchmark findings.
Table 4: Key Research Reagents and Datasets for scFM Benchmarking
| Item | Function in Evaluation | Source / Example |
|---|---|---|
| Norman et al. Dataset | Provides ground-truth data for single and double gene perturbations (CRISPRa) in K562 cells. Used for benchmarking double perturbation prediction [2]. | Norman et al. 2019, as processed by subsequent studies [2]. |
| Replogle et al. Dataset | A large-scale CRISPRi dataset in K562 and RPE1 cells. Used for benchmarking the prediction of unseen single perturbations [2]. | Replogle et al. 2022 [2]. |
| Andersen Cascade Impactor (ACI) | Standard apparatus for measuring Aerodynamic Particle Size Distribution (APSD) of inhaled formulations like MDIs. Used in pharmaceutical development [27]. | Referenced in Pharmacopoeial methods (Ph Eur 2.9.18 / USP <601>) [27]. |
| Tabula Sapiens Dataset | A large, multi-tissue, single-cell reference atlas. Used as a benchmark for evaluating zero-shot performance on cell type clustering and batch integration [1]. | Tabula Sapiens Consortium [1]. |
| Simple Additive Model | A critical baseline model that sums the effects of single perturbations to predict a double perturbation. Used to validate if more complex models provide any advantage [2]. | Implemented from scratch as per benchmarking studies [2]. |
| Linear Model with Embeddings | A simple yet powerful model for predicting unseen perturbations. It uses gene and perturbation embedding matrices learned from data [2]. | Kernfeld et al. / Csendes et al. (preprints) [2]. |
Q1: What is the core finding of recent benchmarks on single-cell Foundation Models (scFMs) for perturbation prediction? Recent rigorous benchmarks consistently show that large, pretrained scFMs often fail to outperform deliberately simple baseline models when predicting gene expression changes after genetic perturbations in a zero-shot setting. Notably, simple baselines like an additive model (summing individual logarithmic fold changes) or even just predicting the mean expression from the training data can match or exceed the performance of complex foundation models like scGPT, Geneformer, and scFoundation [2] [1] [6].
Q2: Why is zero-shot evaluation particularly important for scFMs? Zero-shot evaluation tests a model's ability to perform a task without any additional task-specific training. This is critical for:
Q3: What are some common simple baselines used in these benchmarks? Benchmarks often compare scFMs against the following baselines:
No change model: Always predicts the same expression as the control condition.Additive model: For a double gene perturbation, predicts the sum of the individual logarithmic fold changes.Mean prediction: Always predicts the overall average gene expression from the training data.Q4: Do scFMs provide useful data representations (embeddings) for perturbation tasks? While scFMs like scGPT and scFoundation learn gene embeddings during pretraining, benchmarks found that using these embeddings in a simple linear model did not consistently outperform linear models using embeddings derived directly from the perturbation data itself. This suggests that pretraining on large single-cell atlases may offer only a small benefit for this specific task compared to training on relevant perturbation data [2].
Q5: What is PertEval-scFM? PertEval-scFM is a standardized benchmarking framework designed specifically to evaluate the ability of models, particularly single-cell Foundation Models, to predict transcriptional responses to genetic perturbations. It provides a rigorous and systematic way to test models in a zero-shot setting, highlighting their current limitations and guiding future development [6] [23] [5].
Problem: Your scFM is not accurately predicting gene expression changes after a genetic perturbation and is underperforming compared to simple baselines.
Solutions:
additive model for double perturbations or a mean predictor. This establishes a performance floor and helps quantify the actual value added by the complex model [2].Problem: The model performs well on data similar to its training set but fails to generalize to new cell types, tissues, or experimental conditions without fine-tuning.
Solutions:
The table below summarizes the key experimental methodology from foundational benchmarking studies.
Table 1: Summary of Key Benchmarking Experiments
| Benchmark Study | Core Task | Models Evaluated | Simple Baselines | Key Evaluation Metric |
|---|---|---|---|---|
| Nature Methods (2025) [2] | Predict transcriptome changes after single/double perturbations. | scGPT, scFoundation, Geneformer, GEARS, CPA, UCE, scBERT. | 'No change', 'Additive', Linear model, 'Mean' prediction. | L2 distance between predicted & observed expression. |
| Genome Biology (2025) [1] | Zero-shot cell type clustering & batch integration. | Geneformer, scGPT. | Highly Variable Genes (HVG), Harmony, scVI. | Average BIO score, Average Silhouette Width. |
| PertEval-scFM (2024) [6] [23] [5] | Zero-shot perturbation effect prediction. | Five leading scFMs. | Simple baseline models. | Prediction accuracy on strong/atypical perturbations and under distribution shift. |
Table 2: Essential Research Reagents & Computational Tools
| Item Name | Function / Explanation | Relevance to Benchmarking |
|---|---|---|
| PertEval-scFM Framework | A standardized benchmarking framework for evaluating perturbation prediction. | Provides a rigorous protocol and metrics to fairly compare different models against baselines [6] [23]. |
| Additive Model | A simple baseline that sums the LFCs of single perturbations to predict a double perturbation's effect. | Serves as a critical baseline; outperforming it is a minimum requirement for any proposed complex model [2]. |
| Linear Model with Embeddings | A simple predictive model that uses gene/perturbation embeddings as input features. | Used to test whether an scFM's learned representations contain meaningful information for the task [2]. |
| CRISPR Perturbation Datasets | High-quality datasets from studies like Norman et al., Replogle et al., and Adamson et al.. | Provide the essential ground-truth data for training and benchmarking perturbation prediction models [2]. |
| High-Variable Genes (HVG) | A standard feature selection method to filter genes before analysis. | A strong and simple baseline for tasks like cell type clustering and batch correction, often outperforming zero-shot scFMs [1]. |
The following diagram illustrates the logical workflow for benchmarking a single-cell Foundation Model against simple baselines, as described in the referenced studies.
Q1: When should I consider using a zero-shot single-cell foundation model (scFM) over a simpler, traditional method? The decision should be based on a careful evaluation of your specific task and resources. While scFMs are versatile and can be applied to diverse tasks without retraining, recent benchmarks indicate that in many cases, especially for perturbation effect prediction and cell type clustering, simpler methods can match or even surpass their performance [1] [2] [6]. You should prioritize scFMs when you need a single model for multiple exploratory tasks and have the computational resources to run them. For a single, well-defined task with a labeled dataset, a simpler model like a linear baseline or scVI may be more efficient and effective [28].
Q2: My zero-shot scGPT embeddings performed poorly on cell type clustering. What could be the reason? This is a commonly reported issue. The core problem may lie in the pretraining objective itself. Models like scGPT and Geneformer are often trained using a masked language modeling task, where they learn to predict the expression of randomly masked genes. However, evaluations suggest that these models may not have developed a deep, generalizable understanding of gene relationships from this task [1] [29]. They can struggle to predict held-out gene expression accurately, often defaulting to predicting median expression values, which indicates a failure to learn context-specific biological patterns that are crucial for distinguishing cell types zero-shot [29].
Q3: How can I rigorously benchmark my scFM for perturbation prediction to ensure the results are meaningful? A robust benchmark must include deliberately simple baselines. For predicting transcriptome changes after genetic perturbation, a "no change" model (predicting control condition expression) and an "additive" model (summing the logarithmic fold changes of single perturbations for a double perturbation) are essential comparators [2]. Surprisingly, multiple studies have found that current scFMs and other deep learning models often fail to outperform these simple baselines [2] [6]. It is also critical to evaluate the model's ability to predict genetic interactions and its performance on unseen perturbations, using established datasets from studies like Norman et al. and Replogle et al. [2].
Q4: What are the key metrics for evaluating the biological relevance of a scFM's embeddings? Beyond standard clustering metrics, it is important to use metrics that directly assess biological plausibility. Novel metrics like scGraph-OntoRWR measure the consistency between the cell-type relationships captured by the model's embeddings and the known relationships in established cell ontologies [28]. Additionally, for cell type annotation tasks, the Lowest Common Ancestor Distance (LCAD) metric can be used; it measures the ontological proximity of misclassified cell types, ensuring that any errors are biologically reasonable (e.g., confusing two closely related T-cell subtypes) rather than nonsensical [28].
Problem: Poor Zero-Shot Performance on Cell Type Clustering and Batch Integration Issue: Your scFM embeddings fail to separate known cell types or correct for technical batch effects better than established methods like Harmony or scVI.
| Step | Action | Expected Outcome & Further Diagnosis |
|---|---|---|
| 1. Baseline Comparison | Compare your scFM results against a simple Highly Variable Genes (HVG) baseline and integration methods like Harmony or scVI [1]. | If HVG outperforms your scFM, it indicates a fundamental issue with the foundational embeddings [1]. |
| 2. Check Data Overlap | Investigate if your evaluation dataset was part of the model's pretraining corpus (e.g., check original model papers) [1]. | Strong performance on "seen" data but poor performance on "unseen" data suggests overfitting and a lack of generalizability [1]. |
| 3. Visual Inspection | Create UMAP plots colored by both cell type and batch [1]. | A good embedding will show clear clustering by cell type and mixing of batches. If the primary structure is driven by batch, the model has failed at integration [1]. |
| Solution: If the above steps confirm the model's limitations, consider switching to a simpler, more robust method for your specific task. The computational cost of scFMs may not be justified for zero-shot clustering and integration based on current evidence [1] [28]. |
Problem: Inaccurate Prediction of Genetic Perturbation Effects Issue: Your model cannot accurately predict gene expression changes following single or double genetic perturbations.
| Step | Action | Expected Outcome & Further Diagnosis |
|---|---|---|
| 1. Implement Simple Baselines | Benchmark your model against a "no change" model and an "additive" model for double perturbations [2]. | Failure to outperform these baselines is a major red flag that the model has not learned the underlying biological causality [2]. |
| 2. Analyze Failure Modes | Examine which types of perturbations are poorly predicted. | Models often struggle with predicting strong or atypical perturbation effects and are biased towards predicting "buffering" interactions over "synergistic" ones [2] [6]. |
| 3. Test Embedding Utility | Extract the model's gene embeddings and use them in a simple linear predictor [2]. | If the linear model with these embeddings performs as well as the full model, it suggests the model's complex decoder is not adding value [2]. |
| Solution: Given that current scFMs struggle with this task, a pragmatic approach is to use a simple linear model, potentially augmented with pretrained gene embeddings from a foundation model or, more effectively, from prior perturbation data [2]. |
The following tables summarize key findings from recent, rigorous evaluations of single-cell foundation models.
Table 1: Zero-Shot Performance on Core Tasks vs. Baselines This table synthesizes findings from multiple benchmark studies comparing scFMs to established methods. [1] [28] [2]
| Task | Top-Performing Methods | Underperforming Methods | Key Metric | Performance Summary |
|---|---|---|---|---|
| Cell Type Clustering | HVG, scVI, Harmony | scGPT, Geneformer | AvgBIO / ASW | scFMs were consistently outperformed by simpler methods across multiple datasets. In some cases, they performed worse than a randomly initialized model [1]. |
| Batch Integration | HVG, scVI, Harmony | Geneformer, scGPT | Batch Integration Score, PCR | HVG often achieved the best scores. Geneformer consistently ranked last, sometimes increasing batch effect variance [1]. |
| Perturbation Effect Prediction | Additive Model, No-Change Model, Linear Models | scGPT, Geneformer, GEARS, scFoundation | L2 Distance (Predicted vs. True Expr.) | Deep learning models, including scFMs, failed to consistently outperform simple baselines on predicting double perturbation outcomes or unseen perturbations [2] [6]. |
Table 2: scFM Performance on Perturbation Prediction Benchmark Data adapted from a study benchmarking models on the Norman et al. double perturbation dataset. [2]
| Model / Baseline | Prediction Error (L2 Distance, mean ± SE) | Outperforms Additive Baseline? | Notes |
|---|---|---|---|
| Additive Baseline | Reference Value | N/A | Sums single-gene LFCs; does not use double-perturbation data for training. |
| No-Change Baseline | ~1.5x Additive Error | No | Predicts control condition expression. |
| scGPT | ~1.4x Additive Error | No | Struggled to predict strong interactions; predictions varied less than ground truth [2]. |
| Geneformer* | ~1.6x Additive Error | No | Repurposed with a linear decoder; performance was suboptimal [2]. |
| scFoundation | ~1.3x Additive Error | No | Predictions showed limited variation across different perturbations [2]. |
| GEARS | ~1.3x Additive Error | No | Specifically designed for perturbation prediction but was outperformed by simple baselines [2]. |
*Note: Geneformer was not originally designed for this task and was adapted for the benchmark. [2]
Protocol 1: Benchmarking Zero-Shot Embeddings for Clustering This protocol evaluates the intrinsic quality of scFM cell embeddings for discerning cell types without any fine-tuning [1] [28].
Protocol 2: Evaluating Perturbation Effect Prediction This protocol tests a model's ability to predict transcriptional changes after genetic perturbation, a key claimed ability of some scFMs [2] [6].
Zero-Shot scFM Evaluation Workflow
Table 3: Essential Computational Tools and Datasets for scFM Evaluation
| Item | Function / Description | Example Use in Benchmarking |
|---|---|---|
| Benchmarking Datasets | High-quality, publicly available scRNA-seq data with ground truth labels. | Pancreas Dataset [1]: Tests batch integration across 5 technologies.Norman et al. Perturbation Data [2]: Provides single/double CRISPRa perturbations for K562 cells.Tabula Sapiens [1]: A multi-tissue, multi-donor reference atlas. |
| Linear Baselines | Simple models that serve as a critical sanity check. | Additive Model [2]: Baseline for perturbation prediction.No-Change Model [2]: Predicts control expression.HVG + PCA [1]: Baseline for clustering and integration. |
| Established Methods | Robust, non-foundation model algorithms for standard tasks. | scVI [1]: A generative model for data integration and analysis.Harmony [1]: An algorithm for integrating datasets across technologies. |
| Ontology-Informed Metrics | Metrics that incorporate prior biological knowledge. | scGraph-OntoRWR [28]: Evaluates biological consistency of cell relationships.LCAD (Lowest Common Ancestor Distance) [28]: Measures severity of cell type misclassification. |
| Benchmarking Frameworks | Standardized code for fair and reproducible evaluation. | PertEval-scFM [6]: A framework for benchmarking perturbation prediction.MLflow / Weights & Biases [30]: Tools for tracking experiments, parameters, and metrics. |
FAQ 1: What does it mean that simple models outperform foundation models for perturbation prediction? Recent benchmark studies have demonstrated that deliberately simple baseline models can match or even surpass sophisticated single-cell foundation models (scFMs) in predicting gene perturbation effects [2]. For example, a simple 'additive' model, which predicts double-knockout effects by summing the logarithmic fold changes of single knockouts, outperformed models like scGPT, Geneformer, and GEARS on held-out double perturbation data [2]. This highlights a significant challenge in the field: the goal of creating a generalizable model that provides a robust representation of cellular states for predicting experimental outcomes remains elusive [2].
FAQ 2: Why is zero-shot evaluation particularly important for scFMs? Zero-shot evaluation—assessing a model's performance without any task-specific fine-tuning—is critical for judging whether pretraining has endowed the model with a genuine, transferable understanding of biology [1]. This is especially important in single-cell biology, where many tasks are exploratory and lack predefined labels for fine-tuning [1]. Evaluations have revealed that in zero-shot settings, scFMs can be outperformed by simpler methods on tasks like cell type clustering and batch integration, exposing limitations that might be masked by a fine-tuning-based evaluation protocol [1].
FAQ 3: What are biology-informed evaluation metrics? Biology-informed metrics are designed to assess whether a model's outputs align with established biological knowledge. Moving beyond purely technical metrics, they evaluate biological plausibility. Examples include:
FAQ 4: My scFM performs well on technical metrics but yields biologically implausible results. What should I do? This discrepancy indicates a potential failure of the model to capture meaningful biological relationships, despite optimizing for technical benchmarks. You should:
Problem: Your model fails to accurately predict genetic interactions (e.g., in double-gene perturbations), often defaulting to predicting no interaction or showing high error rates.
Investigation & Resolution:
| Step | Action | Expected Outcome |
|---|---|---|
| 1 | Benchmark against an additive baseline. Compare your model's predictions on double perturbations to a simple model that sums the effects of the two single perturbations [2]. | The complex model should significantly outperform the simple additive baseline. If it does not, the foundation model is not capturing the interaction effect. |
| 2 | Classify the types of errors. Analyze whether the model is missing specific classes of genetic interactions, such as synergistic or opposite effects [2]. | The distribution of predicted interaction types (buffering, synergistic, opposite) should roughly match the validated ground truth data. |
| 3 | Check prediction variance. Examine if the model's predictions vary meaningfully across different perturbations or if they are consistently close to zero or the control condition [2]. | Predictions should show appropriate variance that reflects the biological changes in the data. |
Problem: When using your scFM zero-shot (without fine-tuning) to generate cell embeddings, the resulting clusters do not separate known cell types effectively.
Investigation & Resolution:
| Step | Action | Expected Outcome |
|---|---|---|
| 1 | Compare to established baselines. Generate cell embeddings using highly variable genes (HVG), scVI, or Harmony and compare the clustering performance (e.g., using AvgBIO score or ASW) to your scFM's embeddings [1]. | The scFM's zero-shot embeddings should be competitive with or superior to embeddings from these established methods. |
| 2 | Quantify batch effect removal. Use batch integration metrics to check if the primary structure in the embeddings is driven by biological signal (cell type) or technical batch effects [1]. | Biological signal should explain more variance than batch effects in the embedding space. |
| 3 | Use ontology-based metrics. Apply biology-informed metrics like LCAD to understand if misclassifications are at least biologically "close" (e.g., confusing two T-cell subtypes) or "distant" (e.g., confusing a T-cell with a neuron) [4]. | Misclassifications should have a low LCAD, meaning they are between biologically similar cell types. |
Problem: The model performs poorly when predicting the effects of perturbing a gene that was not included in its training data.
Investigation & Resolution:
| Step | Action | Expected Outcome |
|---|---|---|
| 1 | Test a linear model with embeddings. Extract the gene and perturbation embeddings learned by the scFM during pre-training. Use them in a simple linear model (e.g., Eq. 1 from [2]) to predict perturbation effects. | The performance of the linear model using the scFM's embeddings indicates the quality of the representations learned during pre-training. |
| 2 | Compare to a "mean" prediction baseline. A very strong baseline is to simply predict the average expression across the training set for any unseen perturbation [2]. | Your model must significantly outperform this naive baseline to be considered useful. |
| 3 | Leverage external perturbation data. If available, pre-training on a separate, large-scale perturbation dataset (even from a different cell line) can create better perturbation embeddings that improve generalizability [2]. | Pre-training on diverse perturbation data should lead to better performance on new perturbations compared to pre-training only on atlas data. |
The following tables summarize key quantitative findings from recent benchmark studies, providing a reference for expected performance.
Table 1: Benchmarking Perturbation Prediction Performance. This table summarizes results from a study comparing several models and baselines on their ability to predict transcriptome changes after genetic perturbations [2].
| Model / Baseline | Performance on Double Perturbations | Performance on Unseen Single Perturbations |
|---|---|---|
| Additive Model | Outperformed all deep learning models [2] | Not Applicable |
| No Change Model | Competitive with deep learning models [2] | Not Applicable |
| Deep Learning Models (e.g., GEARS, scGPT) | Underperformed compared to the simple additive baseline [2] | Did not consistently outperform a simple linear model or mean prediction baseline [2] |
| Linear Model with Pre-trained Embeddings | Not Reported | Performance was competitive with or better than the original deep learning models [2] |
Table 2: Zero-Shot Performance on Cell-Level Tasks. This table summarizes the performance of scFMs against established baselines in a zero-shot setting, where models are not fine-tuned on the target data [1]. Performance is ranked from best (1) to worst (4).
| Method | Cell Type Clustering (AvgBIO Score) | Batch Integration (Batch Mixing Score) |
|---|---|---|
| Highly Variable Genes (HVG) | 1 [1] | 1 [1] |
| scVI | 2 [1] | 2 [1] |
| Harmony | 3 [1] | 3 [1] |
| scGPT / Geneformer | 4 [1] | 4 [1] |
This protocol outlines the steps to evaluate a model's ability to predict non-additive effects in double-gene perturbations, as performed in [2].
Methodology:
This protocol describes how to evaluate the biological relevance of cell embeddings generated by an scFM without any fine-tuning [4] [1].
Methodology:
The following diagram illustrates a comprehensive biology-informed evaluation workflow for a single-cell foundation model, integrating both technical and biological validation steps.
Table 3: Key Computational Tools and Datasets for Evaluation. This table lists essential resources for conducting a rigorous evaluation of single-cell foundation models.
| Item Name | Type | Function / Explanation |
|---|---|---|
| Norman et al. Data | Dataset | A key dataset for benchmarking perturbation prediction, containing transcriptome profiles for 100 single-gene and 124 double-gene perturbations in K562 cells [2]. |
| Additive Baseline Model | Computational Baseline | A simple model that predicts the effect of a double perturbation by summing the logarithmic fold changes of the two single perturbations. Crucial for benchmarking complex models [2]. |
| scGraph-OntoRWR | Evaluation Metric | A novel metric that quantifies the consistency between cell-type relationships learned by the model and the known relationships in a cell ontology, providing a biology-informed performance measure [4]. |
| Cell Ontology | Knowledge Base | A structured, controlled vocabulary for cell types. Serves as the source of prior biological knowledge for metrics like scGraph-OntoRWR and LCAD [4]. |
| Replogle & Adamson Data | Dataset | Large-scale single-cell CRISPRi perturbation datasets (in K562 and RPE1 cells) used for benchmarking a model's ability to generalize to unseen perturbations [2]. |
Current benchmarks reveal that single-cell foundation models (scFMs) often fail to outperform simpler, traditional methods for predicting transcriptional responses to genetic perturbations [2] [6]. While scFMs are powerful tools for integrating diverse datasets and exploring biological systems, their performance varies significantly across tasks, with no single model consistently dominating others [4].
Table 1: scFM Performance Summary on Perturbation Tasks
| Model | Reported Performance on Perturbation Tasks | Key Limitations Identified |
|---|---|---|
| scGPT | Does not consistently outperform simple additive or linear baselines [2]. | Struggles with predicting strong or atypical perturbation effects; predictions show insufficient variance [2]. |
| scFoundation | Claimed capability for perturbation prediction [2]. | Requires specific gene sets, limiting application to datasets with missing genes; predictions vary less than ground truth [2]. |
| Geneformer | Can be repurposed for perturbation prediction but not its primary design [2]. | Underperforms in zero-shot settings like batch integration and cell type clustering [1]. |
| UCE & scBERT | Repurposable for perturbation tasks via linear decoder [2]. | Not originally designed for perturbation prediction, leading to suboptimal performance [2]. |
| General scFMs | Zero-shot embeddings do not provide consistent improvement over baselines, especially under distribution shift [6]. | All models struggle with predicting genetic interactions (synergistic/opposite) and often default to predicting "buffering" effects [2]. |
A critical finding from recent studies is that the goal of creating a generalizable representation for predicting the outcome of novel experiments remains largely elusive [2]. Furthermore, in zero-shot settings—where models are used without any task-specific fine-tuning—scFMs can show significant reliability challenges and be outperformed by simpler methods [1].
Zero-shot evaluation is critical for assessing whether pretraining has endowed the model with a true, transferable understanding of biology, which is essential for exploratory research where labels are unknown or fine-tuning is not feasible [1]. In the context of perturbation research, this capability is paramount for in silico prediction of experiments that have not yet been conducted, a key promise of foundation models.
However, evidence suggests that the masked language model pretraining framework used by many scFMs may not inherently produce cell embeddings that are useful for zero-shot perturbation prediction [1] [2]. This represents a significant limitation for researchers who need to apply models to novel diseases, uncharacterized cell types, or unprecedented combinatorial perturbations where no training data exists.
Benchmarking scFMs for perturbation effect prediction involves evaluating their ability to predict gene expression changes after single or double genetic perturbations. The following workflow outlines a standardized protocol adapted from recent rigorous benchmarks [2].
Detailed Protocol:
Table 2: Essential Research Reagents & Solutions for scFM Perturbation Analysis
| Item Name | Function/Description | Example Source/Identifier |
|---|---|---|
| Perturbation Datasets | Provides ground-truth transcriptome data for training and benchmarking models. | Norman et al. data (GEO); Replogle et al. K562/RPE1 CRISPRi data [2]. |
| Reference Datasets | Used for evaluating zero-shot capabilities in tasks like cell type annotation and batch integration. | Pancreas dataset; Tabula Sapiens; PBMC (12k) dataset [4] [1]. |
| Benchmarking Framework | Standardized codebase to ensure fair and reproducible model comparisons. | PertEval-scFM framework [6]. |
| Baseline Models | Simple, non-foundation model benchmarks (e.g., HVG selection, linear models) essential for performance context. | Highly Variable Genes (HVG); Harmony; scVI; Simple Additive Model [1] [2]. |
| Compute Resources | High-performance computing (HPC) or cloud resources are necessary for training and fine-tuning large models. | GPUs (e.g., NVIDIA A100/H100) for efficient training [4]. |
This is a common issue reported in benchmarks, where models like scGPT and GEARS predict expression changes that vary considerably less than the ground-truth data and largely fail to capture correct synergistic interactions [2].
Potential Causes and Solutions:
The decision should be guided by a clear understanding of your project's constraints and goals. The following diagram can help guide this decision.
Decision Guide:
The current limitations highlight a need for more specialized model architectures and higher-quality, broader perturbation datasets [6]. Future directions may include:
No. Comprehensive benchmarks conclude that no single scFM consistently outperforms all others across diverse application scenarios [4]. Model performance is highly dependent on the specific task (e.g., batch integration vs. perturbation prediction), dataset size, and biological context.
The most effective strategy is not to search for a single "best" model but to adopt a benchmarking-driven approach. For your specific dataset and perturbation task, run a focused benchmark comparing several promising scFMs (e.g., scGPT, Geneformer) against the simple baselines described in the experimental protocols above. This is the only way to make a data-driven selection tailored to your research needs [4] [2].
The current generation of single-cell foundation models represents a significant step forward in computational biology, yet our analysis underscores that their zero-shot application to perturbation prediction remains fraught with challenges. The key takeaway is that these models, in their raw pretrained state, are not a magic bullet; they often cannot surpass simple baselines and require careful handling. However, the path forward is clear. Success hinges on moving beyond a pure zero-shot paradigm through strategic fine-tuning, incorporating structured biological knowledge, and adhering to rigorous, standardized benchmarking. Future progress will depend on developing more specialized models trained on higher-quality, broader perturbation datasets and creating evaluation frameworks that prioritize biologically meaningful insights over purely technical metrics. For biomedical research, this critical evolution will be essential to truly leverage scFMs for accelerating drug discovery and unraveling complex disease mechanisms.