In Silico Perturbation Modeling with Single-Cell Foundation Models: A New Frontier for Predictive Biology

Mason Cooper Nov 27, 2025 40

In silico perturbation modeling using single-cell foundation models (scFMs) promises to revolutionize biological discovery and therapeutic development by predicting cellular responses to genetic and chemical interventions.

In Silico Perturbation Modeling with Single-Cell Foundation Models: A New Frontier for Predictive Biology

Abstract

In silico perturbation modeling using single-cell foundation models (scFMs) promises to revolutionize biological discovery and therapeutic development by predicting cellular responses to genetic and chemical interventions. This article explores the foundational concepts of scFMs, their architectural principles, and their application in predicting perturbation effects. It critically examines current methodological approaches, including the emerging 'closed-loop' fine-tuning paradigm, which significantly enhances predictive accuracy by iteratively incorporating experimental data. Furthermore, the article addresses the pressing challenges and limitations highlighted by recent rigorous benchmarks, which show that current models often struggle to outperform simple linear baselines. Finally, it provides a comprehensive overview of the validation landscape, synthesizing insights from multiple benchmarking studies to guide researchers in evaluating model performance and to outline a path forward for realizing the full potential of virtual cell models in biomedical research.

Demystifying Single-Cell Foundation Models: From Core Concepts to Architectural Principles

What Are Foundation Models? Defining the Self-Supervised Learning Paradigm for Biology

Foundation models represent a revolutionary class of artificial intelligence systems trained on vast datasets using self-supervised learning objectives, enabling them to develop generalized representations that can be adapted to diverse downstream tasks without task-specific training [1]. In biology, these models are transforming how researchers analyze complex biological systems by learning fundamental patterns from massive, unlabeled datasets including genomic sequences, single-cell transcriptomes, and protein structures [2]. The core innovation of foundation models lies in their self-supervised pretraining phase, where models learn to predict masked or contextually relevant elements within their input data, thereby capturing deep biological relationships without human-provided labels [1] [3].

The application of foundation models to biological data represents a paradigm shift from traditional supervised approaches, which require extensive labeled datasets that are often expensive and time-consuming to create [3]. Instead, biological foundation models leverage the enormous quantities of unlabeled data being generated by modern high-throughput technologies, from single-cell sequencing platforms to genomic databases [1] [2]. This approach has proven particularly powerful in biological domains where labeled data is scarce but unlabeled data is abundant, enabling models to learn the fundamental "language" of biology—whether that be the grammar of gene regulation, the syntax of protein folding, or the vocabulary of cellular states [1].

Table: Key Characteristics of Biological Foundation Models

Characteristic	Description	Biological Examples
Self-Supervised Pretraining	Models learn by predicting masked portions of input data without human labeling	Predicting masked genes in single-cell data [1]
Transfer Learning	Pretrained models adapt to new tasks with minimal additional training	Geneformer fine-tuned for disease-specific predictions [2]
Scalability	Models trained on millions to billions of data points	scGPT trained on ~30 million cells [2]
Multi-task Capability	Single model handles diverse prediction tasks	LPM predicts perturbation effects and identifies mechanisms [4]

Core Concepts: Self-Supervised Learning in Biological Contexts

Self-supervised learning (SSL) represents the foundational training paradigm that enables foundation models to learn meaningful representations from unlabeled biological data [3]. In biological contexts, SSL methods create training signals directly from the data itself by designing pretext tasks that require the model to learn intrinsic patterns and relationships [3]. For genomic sequences, this might involve predicting missing nucleotides or reverse-complement sequences; for single-cell data, this typically means predicting masked gene expressions based on the context of other genes within the same cell [1] [3].

The transformer architecture has emerged as the dominant backbone for biological foundation models due to its ability to capture long-range dependencies and complex relationships within sequential data [1]. In single-cell biology, transformers process gene expression profiles by treating individual genes as "tokens" analogous to words in a sentence, allowing the model to learn how genes co-express and regulate one another across diverse cellular contexts [1]. Models like scBERT and Geneformer employ bidirectional attention mechanisms that consider all genes simultaneously, enabling comprehensive understanding of gene-gene interactions [1] [5]. Alternatively, decoder-based models like scGPT use autoregressive approaches that predict gene expressions sequentially, similar to how language models generate text [1].

Tokenization strategies form a critical component of biological foundation models, determining how raw biological data is transformed into model-processable units [1]. For single-cell data, this involves converting gene expression profiles into discrete tokens, typically by binning expression values or ranking genes by expression level within each cell [1]. A key challenge in this process is that gene expression data lacks natural sequential ordering—unlike words in a sentence—requiring researchers to impose artificial orderings based on expression magnitude or other criteria [1]. Advanced tokenization approaches may incorporate additional biological context, such as gene ontology terms or chromosomal locations, to enrich the input representations [1].

Application to In Silico Perturbation Modeling with scFMs

Single-cell foundation models (scFMs) have emerged as powerful tools for in silico perturbation modeling, enabling researchers to simulate cellular responses to genetic and chemical perturbations without conducting expensive wet-lab experiments [4] [6]. These models learn the fundamental principles of cellular organization from large-scale single-cell atlases, capturing how gene networks interact and respond to disturbances [1]. When applied to perturbation modeling, scFMs can predict transcriptomic changes resulting from gene knockouts, drug treatments, or other interventions, significantly accelerating biological discovery and drug development [4].

The Large Perturbation Model (LPM) represents a cutting-edge approach that specifically addresses the challenges of integrating heterogeneous perturbation data across different experimental contexts, readout modalities, and perturbation types [4]. LPM employs a disentangled architecture that separately represents perturbations (P), readouts (R), and contexts (C) as distinct dimensions, enabling the model to learn generalizable perturbation-response rules that transfer across biological settings [4]. This approach has demonstrated superior performance in predicting post-perturbation transcriptomes compared to existing methods, while also enabling the identification of shared molecular mechanisms between chemical and genetic perturbations [4].

Table: Performance Comparison of Perturbation Modeling Approaches

Method	Architecture	Perturbation Types Supported	Prediction Accuracy (Pearson R)	Key Applications
LPM [4]	PRC-disentangled transformer	Genetic, chemical, multi-omics	0.72-0.89 (across contexts)	Mechanism identification, drug-target mapping
Geneformer [4] [2]	Transformer encoder	Genetic	0.61-0.75	Network dynamics, disease modeling
scGPT [4] [5]	GPT-style decoder	Genetic, chemical	0.65-0.81	Cell annotation, multi-omic integration
CPA [4]	Autoencoder	Chemical, combinations	0.58-0.72	Drug combination prediction
GEARS [4]	Graph-enhanced MLP	Genetic	0.63-0.78	Genetic interaction mapping

In pharmaceutical research, perturbation models are increasingly used to identify novel therapeutic applications for existing compounds and to understand their mechanisms of action [4] [2]. For example, LPM has demonstrated the ability to cluster pharmacological inhibitors with genetic perturbations targeting the same genes, effectively mapping compound-CRISPR relationships in a unified latent space [4]. This approach identified the anti-inflammatory properties of pravastatin, which clustered near non-steroidal anti-inflammatory drugs in the perturbation space—a finding corroborated by clinical observations [4]. Similarly, scGPT-enabled analysis of tumor-associated macrophages identified C5aR1 gene expression as a key modulator of PARP inhibitor resistance in breast cancer models, suggesting promising therapeutic targets [2].

Experimental Protocols for In Silico Perturbation Modeling

Protocol 1: Setting Up the Computational Environment

Objective: Establish a standardized environment for scFM-based perturbation analysis using containerized solutions to ensure reproducibility across research teams [5].

Materials:

Computing resources: GPU cluster with ≥16GB VRAM (e.g., NVIDIA A100 or V100)
Container platform: Docker or Singularity
BioLLM framework [5]
Pretrained scFM weights (scGPT, Geneformer, or LPM)

Procedure:

Environment Configuration:
- Create a Dockerfile based on PyTorch 2.0+ and Python 3.9+
- Install BioLLM using: pip install biollm
- Verify CUDA compatibility and flash attention support

Data Preprocessing:
- Implement quality control using Scanpy or Seurat
- Filter cells with mitochondrial gene percentage >20%
- Remove genes expressed in <10 cells
- Normalize using scTransform or SCANPY's pp.normalize_total
Model Initialization:
- Load pretrained weights through BioLLM's standardized API
- Configure tokenization parameters matching the pretraining setup
- Set attention mechanisms and hidden dimensions according to model specifications

Protocol 2: Performing In Silico Perturbations with LPM

Objective: Simulate transcriptional responses to genetic and chemical perturbations using the Large Perturbation Model architecture [4].

Materials:

LPM implementation (available from original publication)
Perturbation database (e.g., LINCS, DepMap)
Reference single-cell dataset (e.g., CELLxGENE census)

Procedure:

Data Integration:
- Format perturbation data as (P, R, C) tuples
- Align gene identifiers across datasets using HGNC symbols
- Batch correct using Harmony or SCVI if multiple datasets are combined

Model Inference:
- Input desired perturbation (e.g., "KRAS knockout" or "doxorubicin treatment")
- Specify biological context (e.g., "A549 lung cancer cells")
- Define readout parameters (e.g., "transcriptome 48h post-perturbation")
- Execute forward pass through LPM to obtain predicted expression profile
Result Interpretation:
- Calculate differential expression compared to unperturbed control
- Perform pathway enrichment analysis using GO, KEGG, or Reactome
- Compare predicted expression changes to empirical data when available

Protocol 3: Cross-Model Benchmarking with BioLLM

Objective: Compare performance across different scFMs for perturbation prediction tasks using standardized evaluation metrics [5].

Materials:

BioLLM framework installation [5]
Benchmark dataset with empirical perturbation responses
Evaluation metrics suite (silhouette scores, RMSE, Pearson correlation)

Procedure:

Dataset Preparation:
- Curate ground truth perturbation dataset (e.g., CRISPR screens with scRNA-seq readouts)
- Split data into training/validation/test sets (70/15/15%)
- Implement k-fold cross-validation with 5 folds

Model Comparison:
- Initialize multiple scFMs (scGPT, Geneformer, scBERT) through BioLLM APIs
- Fine-tune each model on identical training data
- Generate predictions for held-out test perturbations
- Calculate metrics: Pearson R (gene-level), ASW (cell embedding), RMSE (expression)
Results Analysis:
- Perform paired t-tests between model performances
- Visualize using UMAP/t-SNE for embedding quality assessment
- Identify model strengths by perturbation type and cellular context

Table: Research Reagent Solutions for scFM Perturbation Modeling

Resource	Type	Function	Example Implementation
BioLLM [5]	Software Framework	Standardized API for multiple scFMs	Unified interface for scGPT, Geneformer, scBERT
CELLxGENE [1]	Data Repository	Curated single-cell datasets	>100 million standardized cells for model training
LPM [4]	Specialized Model	Multi-modal perturbation prediction	PRC-disentangled architecture for cross-context prediction
scvi-tools [2]	Analysis Suite	Probabilistic modeling of single-cell data	Differential expression, dimensionality reduction
TabPFN [7]	Tabular Foundation Model	Small-sample tabular predictions	Bayesian inference for experimental design
Self-GenomeNet [3]	SSL Method	Genomic sequence representation	Reverse-complement aware pre-training

The integration of foundation models into biological research requires both computational resources and specialized knowledge. For researchers beginning with scFMs, starting with user-friendly frameworks like BioLLM provides immediate access to multiple models through standardized APIs, eliminating architectural inconsistencies and simplifying benchmarking [5]. When designing perturbation studies, careful consideration of model selection is crucial—encoder-based models like Geneformer excel at gene-level tasks and network inference, while decoder-based models like scGPT demonstrate stronger performance in cell-level predictions and batch effect correction [5].

Data quality remains paramount for successful perturbation modeling. Researchers should prioritize datasets with appropriate controls, sufficient replication, and minimal technical artifacts. For novel therapeutic applications, integration across multiple evidence streams—including foundation model predictions, electronic health records, and experimental validation—creates a compelling case for candidate targets [4] [2]. As these technologies mature, the scientific community is developing standards for reporting model predictions and establishing benchmarks for methodological comparisons, further accelerating the adoption of foundation models in biological discovery and drug development.

Single-cell foundation models (scFMs) represent a transformative approach in computational biology, drawing direct inspiration from large language models (LLMs) in natural language processing (NLP). The core concept involves reframing cellular biology as a linguistic system, where individual cells are treated as "sentences" and the genes within them as "words" or "tokens" [8]. This analogy allows researchers to apply the powerful transformer architecture, which has revolutionized machine understanding of human language, to decipher the complex "language" of cellular function and state [8]. This paradigm shift is particularly impactful for in silico perturbation modeling, where the goal is to predict how targeted genetic interventions might alter cellular states, potentially accelerating therapeutic discovery [9].

Foundational Concepts: From Biological Data to Linguistic Units

The Tokenization Process: Converting Gene Expression to Tokens

Tokenization is the critical first step that converts raw, non-sequential gene expression data into a structured format that transformer models can process. Unlike words in a sentence, genes have no inherent order, requiring scFMs to implement specific strategies to impose sequence [8].

Gene Identity Tokens: Each gene is represented by a unique identifier token, analogous to a word in a dictionary. These tokens are typically converted into dense vector representations (embeddings) [8] [10].
Expression Value Encoding: A gene's expression level in a given cell must be incorporated alongside its identity. Common methods include:
- Value Binning: Discretizing continuous expression values into categorical bins (e.g., low, medium, high) [10].
- Value Projection: Using a neural network layer to project the continuous value into an embedding vector [10].
- Rank-based Ordering: Ranking genes by their expression level within a cell and using this order to create the sequence [8] [10].
Special Tokens: Models often include additional tokens to provide context, such as [CELL] tokens to represent cell-level information, or modality indicators for multi-omics data [8].

Model Architecture: The Transformer Backbone

Most scFMs are built on the transformer architecture, which uses self-attention mechanisms to weigh the importance of all genes (tokens) when processing the information of each individual gene [8]. Two primary architectural variants are employed:

Encoder-based Models (e.g., scBERT, Geneformer): These use a bidirectional attention mechanism, meaning each gene token can attend to all other genes in the cell simultaneously. This is well-suited for classification and embedding tasks [8].
Decoder-based Models (e.g., scGPT): These use a unidirectional (masked) attention mechanism, where each token can only attend to previous tokens in the sequence. This architecture is often used for generative tasks, such as predicting masked genes [8].

Table 1: Overview of Prominent Single-Cell Foundation Models

Model Name	Architecture Type	Primary Pretraining Task	Input Gene Count	Key Differentiating Feature
Geneformer [10]	Encoder	Masked Gene Modeling (Gene ID prediction)	2,048 (ranked)	Uses ranked gene lists; lookup table for gene embeddings.
scGPT [10]	Encoder (with masking)	Iterative Masked Gene Modeling (Value prediction)	~1,200 (HVGs)	Value binning; multi-omics capability; generative pretraining.
scFoundation [10]	Asymmetric Encoder-Decoder	Read-depth-aware Masked Gene Modeling	~19,000	Uses nearly the full transcriptome; value projection.
UCE [10]	Encoder	Binary prediction of gene expression	1,024 (genomic position)	Uses protein sequence embeddings from ESM-2.

The diagram below illustrates the core workflow of how a single cell's data flows through a typical scFM based on the transformer architecture.

Figure 1: From Cell to Embedding: The Core scFM Workflow. This diagram visualizes the process of converting a cell's gene expression profile into a unified latent representation via tokenization and transformer layers.

Protocols for In Silico Perturbation Modeling

In silico perturbation (ISP) is a premier application of scFMs, enabling the prediction of a cell's transcriptional state after a hypothetical genetic manipulation (e.g., gene knockout or overexpression).

Protocol 1: Open-Loop In Silico Perturbation

This is the baseline method for predicting perturbation effects without incorporating prior experimental perturbation data into the model fine-tuning [9].

Model Fine-Tuning for State Classification:
- Input: A dataset of single-cell RNA sequencing (scRNA-seq) data from two biological states (e.g., healthy vs. diseased, resting vs. activated T-cells) [9].
- Process: A pretrained scFM (e.g., Geneformer) is fine-tuned on this dataset to accurately classify a cell's state. This teaches the model the transcriptomic features distinguishing the states.
- Output: A fine-tuned model whose latent space is structured to separate the two states.
Perturbation Simulation and Prediction:
- Input: A query cell from the "diseased" (or target) population.
- Process: The fine-tuned model is prompted to simulate the effect of perturbing a specific gene (e.g., setting its expression to zero for a knockout). The model generates a predicted expression profile for the perturbed cell [9].
- Analysis: The predicted profile is projected into the model's latent space. The direction and magnitude of shift from the original "diseased" state towards the "healthy" state is quantified. A significant shift suggests the perturbation could be therapeutic [9].

Table 2: Performance of Open-Loop ISP vs. Differential Expression (DE) for T-cell Activation [9]

Method	Positive Predictive Value (PPV)	Negative Predictive Value (NPV)	Sensitivity	Specificity
Open-Loop ISP	3%	98%	48%	60%
Differential Expression (DE)	3%	78%	40%	50%
ISP & DE Overlap	7%	-	-	-

Protocol 2: Closed-Loop In Silico Perturbation

This advanced protocol iteratively incorporates experimental data to significantly enhance prediction accuracy, creating a "virtuous cycle" of model improvement [9].

Initial Model Fine-Tuning: Perform the same initial fine-tuning step as in the Open-Loop protocol.
Integration of Perturbation Data:
- Input: scRNA-seq data from a Perturb-seq (or similar) experiment, where cells have been experimentally perturbed. The data is labeled only with the cell's resulting state (e.g., activated/resting), not the perturbed gene's identity [9].
- Process: The model is further fine-tuned on this combined dataset (original state data + perturbation data). This teaches the model how real perturbations manifest in transcriptomic space.
Closed-Loop Prediction and Validation:
- Process: ISP is performed with the newly fine-tuned model on a set of candidate genes.
- Key Advantage: This method dramatically improves prediction quality. In a T-cell activation study, it increased the Positive Predictive Value (PPV) from 3% to 9% and boosted sensitivity to 76% and specificity to 81% [9].
- Iteration: The top predictions can be validated experimentally, and the results can be fed back into the model, further refining its accuracy in subsequent cycles.

Figure 2: The Closed-Loop Framework for Iterative Model Improvement. This workflow demonstrates how integrating experimental perturbation data creates a feedback loop that enhances the scFM's predictive accuracy.

Performance Benchmarking and Current Limitations

Despite their promise, critical benchmarking studies reveal that the performance of scFMs, particularly for perturbation prediction, must be rigorously evaluated against simpler baselines.

Benchmarking Against Simple Baselines

A 2025 benchmark study compared five scFMs and two other deep learning models against simple linear models for predicting transcriptome changes after single or double genetic perturbations [11]. The findings were sobering:

Double Perturbation Prediction: For predicting effects of double-gene perturbations, no deep learning model outperformed a simple additive baseline (sum of individual logarithmic fold changes) [11].
Unseen Perturbation Prediction: In predicting effects of entirely unseen perturbations, foundation models like scGPT and Geneformer were unable to consistently outperform a simple baseline that always predicts the mean expression from the training set [11].
Utility of Pretrained Embeddings: While the end-to-end models struggled, the gene embeddings extracted from scGPT and scFoundation could be used in a simple linear model, which then performed competitively. This suggests the pretraining does capture some useful biological information, but the models' complex decoders may not leverage it optimally for this task [11].

Table 3: Key Findings from Benchmarking scFMs on Perturbation Prediction [11]

Benchmark Scenario	Top Performing Model(s)	Key Implication
Double Gene Perturbation	Additive Linear Model (Baseline)	Current scFMs fail to capture non-additive genetic interactions better than a simple heuristic.
Unseen Single Gene Perturbation	Mean Prediction (Baseline); Linear Model with Perturbation Data	Pretraining on single-cell atlases offers less predictive power than pretraining on perturbation data itself.
Use of Model Embeddings	Linear Model using scGPT/scFoundation Gene Embeddings	Pretrained embeddings contain valuable biological knowledge, but may be better utilized by simpler models.

Practical Application: A Case Study in RUNX1-FPD

The closed-loop framework has shown tangible success in a real disease context. Researchers applied it to RUNX1-Familial Platelet Disorder (RUNX1-FPD), a rare blood disorder [9]. After fine-tuning Geneformer on HSCs with RUNX1 loss-of-function, closed-loop ISP identified 14 high-confidence gene targets whose perturbation could shift diseased cells toward a healthy state. This led to the identification of several therapeutic pathways, including mTOR and protein kinase C, demonstrating the potential of scFMs to accelerate drug discovery for rare diseases where samples are scarce [9].

The Scientist's Toolkit: Essential Research Reagents and Materials

Table 4: Key Research Reagent Solutions for scFM-Based Perturbation Studies

Item / Reagent	Function in scFM Research
Public Cell Atlas Data (e.g., CZ CELLxGENE) [8]	Provides the large-scale, diverse single-cell datasets required for pretraining scFMs. Serves as a source of healthy/diseased reference data.
Perturb-seq / CRISPR Screens [9]	Generates the essential ground-truth dataset of single-cell transcriptomes following experimental genetic perturbations. Critical for closed-loop fine-tuning.
High-Quality scRNA-seq Datasets	Used for the initial fine-tuning of scFMs to learn the transcriptional signatures of specific biological states (e.g., T-cell activation, disease model vs. control).
Engineered Cell Models [9]	Provides a controlled system for modeling genetic diseases (e.g., RUNX1-FPD) and validating in silico perturbation predictions.
GPU Computing Clusters	Provides the necessary computational power for the fine-tuning and inference of large transformer models, which is computationally intensive [8].

Single-cell foundation models (scFMs) represent a revolutionary convergence of deep learning and computational biology, with transformer architectures at their core. These models fundamentally reinterpret cellular biology by treating individual cells as "sentences" and genes or genomic features as "words" or "tokens" [12]. This conceptual framework allows researchers to analyze cellular heterogeneity and complex regulatory networks using the same architectural principles that have revolutionized natural language processing. The adaptation of transformer models to single-cell genomics addresses a critical need for unified frameworks capable of integrating and comprehensively analyzing rapidly expanding biological data repositories, which now encompass tens of millions of single-cell omics datasets spanning diverse tissues, species, and biological conditions [12] [13].

The core innovation lies in applying self-supervised learning to vast single-cell datasets, enabling models to capture fundamental biological principles that generalize across diverse downstream tasks. Unlike traditional single-task models, scFMs leverage transformer architectures to incorporate diverse omics data—including single-cell RNA sequencing (scRNA-seq), single-cell ATAC sequencing (scATAC-seq), multiome sequencing, spatial transcriptomics, and proteomics—extracting latent patterns at both cell and gene/feature levels [12]. This architectural foundation has enabled breakthroughs in cross-species cell annotation, in silico perturbation modeling, and gene regulatory network inference, representing a paradigm shift toward scalable, generalizable frameworks capable of unifying diverse biological contexts [13].

Architectural Foundations: From Attention to Biological Insight

Core Attention Mechanism Components

The transformative capability of scFMs originates from the attention mechanism, which enables models to dynamically weight the importance of different genes when making predictions about cellular states. The mechanism operates through three fundamental components:

Queries (Q): Vectors that represent the current focus or question being asked about the cellular state, analogous to asking "What here is relevant?" in a biological context [14].
Keys (K): Vectors that contain information about what each gene token can provide, essentially answering "Here's what I have!" from the perspective of individual genomic features [14].
Values (V): Vectors that contain the actual biological information used to construct the model's output representations [14].

These components are derived from the same input gene embeddings through learned linear transformations, allowing the model to project genomic data into spaces where biological relationships become computationally apparent [14]. The attention weights are calculated through scaled dot-product operations, followed by softmax normalization to convert similarity scores into probabilities that highlight the most important genetic relationships for any given cellular context [14].

Multi-Head Attention and Biological Specialization

Advanced scFMs employ multi-head attention, which operates like a team of biological experts analyzing the same cellular data from different perspectives. Each attention "head" independently focuses on distinct biological relationships—such as regulatory dynamics, functional pathways, or co-expression patterns—with their outputs merged to form rich, nuanced cellular representations [14]. This architectural approach enables models to capture diverse relationship types simultaneously, making them robust to biological variability and complexity [14].

For single-cell data, which lacks natural sequential ordering unlike linguistic data, transformers require specialized adaptation through deterministic gene ranking strategies. Common approaches include ranking genes by expression levels within each cell or partitioning genes into expression value bins, creating artificial "sentences" from fundamentally non-sequential data [12]. Positional encoding schemes are then adapted to represent the relative order or rank of each gene in the cell, preserving critical information about expression hierarchies [12].

Model Architecture Variations

scFMs demonstrate significant architectural diversity while maintaining core transformer principles:

Table: Architectural Variations in Single-Cell Foundation Models

Model Type	Architecture	Tokenization Approach	Gene Ranking Method	Notable Examples
Encoder-based	Bidirectional transformer	Gene-level tokens	Expression magnitude ranking	scBERT, Geneformer
Decoder-based	Autoregressive transformer	Natural language tokenization	Rank-based sequencing	cell2sentence (C2S)
Hybrid	Transformer with specialized components	Combined gene and metadata tokens	Multi-factor ranking	scGPT, scPlantFormer

Most scFMs use variants of the transformer architecture configured with different attention head counts, layer depths, and hidden dimension sizes [12]. Encoder-based models like scBERT employ bidirectional attention to capture genomic context from both "directions" simultaneously, while decoder-based models like cell2sentence leverage autoregressive approaches that generate gene sequences sequentially [15]. Emerging hybrid architectures incorporate specialized components for spatial relationships, phylogenetic constraints, or multimodal integration [13].

Application Notes: Experimental Protocols for scFM Implementation

Protocol 1: In Silico Perturbation Prediction with Closed-Loop Fine-Tuning

Purpose: To predict transcriptional responses to genetic perturbations and iteratively improve prediction accuracy through experimental feedback.

Background: In silico perturbation (ISP) modeling enables researchers to simulate how cells respond to genetic manipulations without costly wet-lab experiments. The "closed-loop" approach significantly enhances prediction accuracy by incorporating experimental perturbation data during model fine-tuning [9].

Materials:

Pre-trained scFM (e.g., Geneformer-30M-12L)
Single-cell RNA sequencing data from resting and activated cell states
Perturb-seq data with genetic perturbation labels
Computational environment with GPU acceleration

Procedure:

Baseline Fine-tuning: Fine-tune the pre-trained scFM to classify cell states using scRNA-seq data from resting and activated conditions. Validate classification accuracy on hold-out test sets (>99% accuracy achievable) [9].
Open-loop ISP: Perform in silico perturbation across the gene set, simulating both gene knockout and overexpression. For Geneformer, this involves computationally masking target genes and predicting expression changes [9].
Experimental Validation: Validate open-loop predictions against orthogonal measurement modalities (e.g., flow cytometry for T-cell activation markers) to establish baseline performance [9].
Closed-loop Fine-tuning: Incorporate Perturb-seq data alongside original scRNA-seq data during additional fine-tuning cycles. Critically, the Perturb-seq data should be labeled with activation status but not with specific gene perturbations to prevent data leakage [9].
Closed-loop ISP: Perform perturbation predictions using the refined model, excluding genes used in perturbation training to avoid circularity [9].
Performance Assessment: Evaluate positive predictive value, negative predictive value, sensitivity, and specificity against ground truth measurements. The closed-loop approach has demonstrated three-fold improvement in positive predictive value (from 3% to 9%) with concurrent enhancements in other metrics [9].

Troubleshooting:

If performance improvements plateau, incrementally add perturbation examples (10-20 examples often sufficient for substantial improvement) [9].
For batch effect concerns, incorporate batch information as special tokens during fine-tuning [12].
If model fails to converge, verify gene tokenization strategy matches pre-training approach [12].

Protocol 2: Mechanistic Interpretability via Transcoder-based Circuit Analysis

Purpose: To extract biologically interpretable decision-making circuits from scFMs, connecting model internal mechanisms to known biological pathways.

Background: A significant challenge in scFMs is the "black box" nature of their predictions. Transcoder-based circuit analysis resolves the polysemanticity problem—where individual model components encode multiple biological concepts simultaneously—by decomposing transformer operations into interpretable components [15].

Materials:

Trained scFM (e.g., cell2sentence model)
Target dataset for analysis (e.g., Heart Cell Atlas v2)
Transcoder implementation adapted for scFMs
Computational resources for feature attribution analysis

Procedure:

Transcoder Training: Train transcoders on each MLP layer of the target scFM using a biologically relevant dataset (90/10 train/validation split recommended). Use a maximum learning rate of 1×10⁻⁴ and sparsity regularization to encourage interpretable features [15].
Feature Activation Analysis: For biological questions of interest, identify transcoder features with high activation levels. These represent specialized biological functions learned by the model [15].
Circuit Tracing: Calculate attribution scores between transcoder feature pairs using the formula: z^(l,i)(x) × (fdec^(l,i) · fenc^(l',j)), where z represents input-dependent activation and the dot product represents input-independent connections [15].
Attention Head Integration: Track information flow across different genomic tokens through attention head OV matrices, identifying which gene tokens contribute to specific biological features [15].
Computational Subgraph Extraction: Iteratively apply attribution calculations to identify primary computational paths that activate specific biological features, integrating these paths into sparse computational subgraphs representing the model's internal decision-making process [15].
Biological Validation: Establish correspondence between extracted circuits and known biological mechanisms through literature validation and experimental data comparison [15].

Troubleshooting:

If transcoder features remain polysemantic, increase sparsity regularization strength [15].
For weak circuit signals, focus on high-variance biological contexts where model predictions are most confident.
If biological interpretation proves difficult, integrate gene ontology databases or known pathway information as prior knowledge [15].

Benchmarking and Performance Evaluation

Quantitative Performance Assessment

Recent systematic benchmarking reveals critical insights into scFM capabilities and limitations, particularly for perturbation prediction tasks:

Table: Benchmarking Results for Perturbation Effect Prediction

Model/Approach	Double Perturbation Prediction Error (L2 Distance)	Single Perturbation Prediction	Genetic Interaction Detection	Computational Efficiency
Simple Additive Baseline	Reference performance	Varies by dataset	Not applicable	Most efficient
No Change Baseline	Higher than additive	Outperformed by linear models	Limited to buffering interactions	Most efficient
scGPT	Higher than baselines	Comparable to linear models	Poor (mostly buffering)	Moderate
Geneformer	Higher than baselines	Below linear models	Poor	Moderate
scBERT	Highest among benchmarks	Below linear models	Poor	Less efficient
Linear Model with Pretrained Embeddings	N/A	Best performance	Varies	Efficient

Notably, current scFMs generally do not outperform deliberately simple baselines for perturbation effect prediction, particularly in zero-shot settings where models must generalize without task-specific fine-tuning [11] [16] [17]. The additive baseline model, which simply sums individual logarithmic fold changes for double perturbations, consistently outperforms or matches complex foundation models across multiple benchmarks [11]. Similarly, simple linear models using pretrained perturbation embeddings outperform foundation models for predicting effects of unseen single perturbations [11].

Critical Considerations for Model Selection

Performance evaluations across multiple domains reveal distinct model strengths and trade-offs:

Embedding Quality: scGPT consistently generates the highest quality cell embeddings in zero-shot settings, achieving superior separation of cell types in visualization landscapes [5].
Batch Effect Correction: scGPT demonstrates superior performance in removing technical batch effects while preserving biological distinctions, though all models struggle with cross-technology integration [5].
Input Length Sensitivity: scGPT embedding quality improves with longer gene input sequences, while scBERT performance typically degrades with increasing sequence length [5].
Computational Efficiency: scGPT and Geneformer show superior memory and computational efficiency compared to scBERT and scFoundation, making them more practical for large-scale analyses [5].

These findings highlight that model selection must be guided by specific application requirements rather than assuming general superiority of foundation models over simpler approaches.

Research Reagent Solutions: Essential Computational Tools

Table: Essential Research Reagents and Computational Tools for scFM Research

Resource Name	Type	Primary Function	Application Context
BioLLM	Standardized Framework	Unified interface for diverse scFMs	Model benchmarking and deployment [5]
PertEval-scFM	Benchmarking Framework	Standardized evaluation of perturbation predictions	Model performance validation [16] [17]
CZ CELLxGENE	Data Repository	Unified access to annotated single-cell datasets	Pretraining data sourcing [12]
DISCO	Data Atlas	Aggregated single-cell data for federated analysis	Multimodal data integration [13]
cell2sentence (C2S)	Pre-trained Model	Decoder-based scFM with biological literature training	Interpretability studies [15]
Geneformer	Pre-trained Model	Encoder-based scFM with focus on gene relationships	Gene-level tasks [5]
scGPT	Pre-trained Model	Large-scale transformer supporting multi-omic tasks	General-purpose applications [5]

Visualizing Architectural Components and Workflows

Core scFM Architecture with Attention Mechanisms

Closed-Loop In Silico Perturbation Workflow

While transformer-based scFMs represent a significant architectural advancement in computational biology, substantial challenges remain. Current models face limitations in perturbation effect prediction, often failing to outperform simple linear baselines [11] [16]. The interpretability gap persists despite advances in mechanistic interpretability techniques [15], and batch effects continue to complicate cross-study integration [5].

Future developments will likely focus on specialized architectures for perturbation modeling, improved multimodal integration strategies, and more biologically-grounded benchmarking frameworks. The emergence of closed-loop approaches that iteratively incorporate experimental feedback demonstrates promising pathways for enhancing predictive accuracy [9]. As the field matures, standardized evaluation frameworks like BioLLM [5] and PertEval-scFM [16] [17] will be crucial for directing methodological progress toward biologically meaningful improvements rather than purely algorithmic advancements.

The integration of transformer architectures with single-cell genomics has unquestionably transformed the scale and scope of computational biological analysis. Through continued architectural innovation and rigorous biological validation, scFMs hold the potential to evolve from powerful pattern recognition tools into genuinely predictive in silico models of cellular behavior.

The development of robust single-cell Foundation Models (scFMs) for in silico perturbation modeling is fundamentally constrained by the scale, diversity, and quality of the data used for their pretraining. A foundation model is a large-scale deep learning model pretrained on vast datasets, enabling it to be adapted to a wide range of downstream tasks through self-supervised learning [1]. The premise of scFMs is that by exposing a model to millions of cells encompassing diverse tissues, species, and conditions, it can learn fundamental, generalizable principles of cellular identity and function [1]. For perturbation modeling, this extensive pretraining is critical, as it allows the model to internalize a representation of the "normal" cellular state space, against which the effects of genetic or chemical perturbations can be accurately predicted. The success of models like scGPT, pretrained on over 33 million cells, demonstrates the power of this approach [18]. This protocol details the data sources and methodologies for constructing a comprehensive pretraining corpus tailored for scFMs focused on perturbation biology.

Compendium of Public Data Repositories

Assembling a pretraining dataset requires leveraging multiple public repositories that host and standardize single-cell data. The table below summarizes the key resources, their primary content, and quantitative metrics relevant for scFM development.

Table 1: Key Public Repositories for Single-Cell and Perturbation Data

Repository Name	Primary Content & Specialization	Reported Scale (Cells / Datasets)	Notable Features for Perturbation Modeling
CZ CELLxGENE [1]	Curated single-cell census data; multi-species, multi-tissue	Over 100 million cells [1]	Unified access to annotated datasets; standardized for analysis
Human Cell Atlas (HCA) [19]	Multi-omic, community-generated open data	70.3 million cells; 523 projects; 11.2k donors [19]	Aims for a comprehensive reference map of all human cells
PerturbSeq.db [20]	Curated single-cell perturbation datasets	189 datasets (165 scRNA-seq, 24 scATAC-seq) from 77 studies [20]	Dedicated to genetic (CRISPR) and chemical perturbation data
Expression Atlas [21]	Bulk and single-cell gene expression under different conditions	Information missing	Provides differential expression data across diverse biological conditions
DISCO [18]	Single-cell omics data browser and repository	Aggregates over 100 million cells [18]	Supports federated analysis across multiple data sources
Gene Expression Omnibus (GEO) / SRA [1]	Primary archive for high-throughput sequencing data	Thousands of single-cell studies [1]	Raw, primary data; requires significant processing and curation

Protocol: Constructing a Pretraining Corpus for Perturbation scFMs

This protocol outlines a systematic procedure for building a large-scale, high-quality pretraining dataset from the repositories listed above, with a specific emphasis on enabling robust in silico perturbation modeling.

Stage 1: Data Discovery and Selection

Objective: To identify and select relevant datasets that maximize biological and technical diversity. Materials: Access to the internet, computational resources for metadata handling. Procedure:

Prioritize Perturbation-Centric Repositories: Begin the data collection by querying specialized perturbation databases like PerturbSeq.db. This repository is pre-curated and provides a direct source of single-cell data from CRISPR-based (KO, CRISPRi, CRISPRa) and small-molecule compound screens [20].
Expand to General Cell Atlases: Integrate data from large-scale cell atlases, primarily CZ CELLxGENE and the Human Cell Atlas [1] [19]. These resources provide the essential "baseline" representation of cellular states across tissues, donors, and species, which is the foundation upon which perturbation responses are modeled.
Define Inclusion Criteria: Establish and adhere to the following criteria during dataset selection:
- Species: Focus on Homo sapiens and Mus musculus, as they are the best-represented species in public repositories and are primary models for drug development [20].
- Cell and Tissue Diversity: Actively select datasets from a wide range of primary cells, cell lines, and tissues (e.g., immune cells, neural, epithelial) to ensure the model learns a generalizable representation of cellular systems [20].
- Technology and Modality: Initially focus on single-cell RNA sequencing (scRNA-seq) data due to its abundance. Subsequently, incorporate single-cell ATAC-seq (scATAC-seq) data to empower the model to reason about gene regulatory networks underlying perturbation responses [20].
- Metadata Completeness: Give preference to datasets with comprehensive sample annotations, including donor information, tissue origin, and detailed experimental protocols.

Stage 2: Data Retrieval and Quality Control

Objective: To download selected data and perform rigorous quality control to ensure dataset integrity. Materials: High-performance computing cluster, sufficient data storage, tools like wget or aws s3 for data transfer, and single-cell analysis toolkits (e.g., Scanpy in Python). Procedure:

Data Download: Download datasets in standardized formats such as H5AD (.h5ad files) where available, as this is the common format for CZ CELLxGENE and many other resources [22].
Initial Quality Filtering: For each dataset, apply standard single-cell QC filters using a consistent pipeline. This typically includes:
- Removing cells with an unusually low or high number of detected genes (potential empty droplets or doublets).
- Filtering cells with a high percentage of mitochondrial reads (indicative of apoptotic or low-quality cells).
- Filtering out genes that are detected in only a very small number of cells.
Batch Effect Audit: Visually inspect the data using Uniform Manifold Approximation and Projection (UMAP) plots colored by dataset of origin, sequencing platform, and other technical variables. This qualitative assessment is crucial for identifying strong technical batch effects that will require specialized handling during integration [1].

Stage 3: Data Integration and Harmonization

Objective: To merge the individually curated datasets into a unified, analysis-ready corpus while mitigating technical noise. Materials: Integrated development environment (e.g., RStudio, Jupyter Notebook), single-cell integration tools (e.g., scVI, Harmony, Scanorama). Procedure:

Feature Space Unification: Intersect the gene features (vocabulary) across all datasets to create a common feature space for model input. This often involves retaining only highly variable genes that are robustly measured across multiple studies.
Apply Integration Algorithms: Utilize advanced batch integration methods, such as sysVI (a batch-aware conditional Variational Autoencoder) or similar tools, to align the datasets [18]. The goal is to create a shared latent space where cells cluster by biological identity rather than technical origin.
Corpus Splitting: Partition the fully integrated corpus into three distinct sets:
- Pretraining Set (~90%): Used for the self-supervised training of the scFM.
- Validation Set (~5%): Used for monitoring training progress and tuning hyperparameters.
- Hold-out Test Set (~5%): A completely withheld set of data, ideally from unique studies or donors, used for the final evaluation of the model's generalization performance, especially on perturbation prediction tasks.

The following diagram illustrates the complete workflow from data discovery to a finalized pretraining corpus.

Figure 1: Workflow for building a pretraining corpus for perturbation scFMs.

The Scientist's Toolkit: Essential Research Reagents and Materials

The following table lists key computational tools, data resources, and platforms that constitute the essential "reagent solutions" for developing scFMs for perturbation modeling.

Table 2: Key Research Reagent Solutions for scFM Pretraining

Item Name	Type	Primary Function in Pretraining
PerturbSeq.db [20]	Database	A pre-curated repository of single-cell perturbation datasets, providing ready-made data for training and benchmarking perturbation models.
CZ CELLxGENE / HCA [1] [19]	Data Platform	Provides the foundational "baseline" cellular data at scale, essential for teaching the model normal cellular states.
scGPT / scPlantFormer [18]	Foundation Model	Examples of state-of-the-art scFMs whose architectures and pretraining protocols can be adopted or adapted for new models.
BioLLM [18]	Software Framework	A standardized framework for integrating and benchmarking different single-cell foundation models, enabling performance comparison.
sysVI [18]	Computational Tool	A batch integration tool that preserves biological variation while removing technical noise, critical for data harmonization.
FISHscale / FISHspace [22]	Analysis Pipeline	Software for processing and analyzing spatial transcriptomics data (e.g., EEL-FISH), allowing for the inclusion of spatial context.

Visualization of Data Quality Control Workflow

A critical, iterative step in the protocol is ensuring the quality of the incoming data. The diagram below details the quality control process applied to each dataset before integration.

Figure 2: Data quality control and batch effect audit workflow.

The construction of a high-quality, large-scale pretraining corpus is a foundational step in developing scFMs capable of accurate in silico perturbation modeling. By systematically leveraging public repositories—from specialized resources like PerturbSeq.db for perturbation data to expansive atlases like the HCA for cellular baselines—and adhering to a rigorous protocol of selection, quality control, and integration, researchers can build the robust datasets required to power the next generation of predictive models in computational biology and drug discovery.

Tokenization, the process of converting raw gene expression data into discrete, model-readable units or "tokens," is a foundational step in building single-cell foundation models (scFMs). For in silico perturbation modeling—where the goal is to computationally predict cellular responses to genetic or chemical perturbations—the choice of tokenization strategy directly impacts a model's ability to learn meaningful biological representations and generalize to unseen data. This document outlines prevalent tokenization strategies, provides quantitative comparisons, details experimental protocols for their implementation, and visualizes key workflows and pathways relevant to perturbation modeling.

In single-cell RNA sequencing (scRNA-seq) analysis, tokenization strategies define how the high-dimensional and non-sequential gene expression profile of a single cell is transformed into a structured sequence for transformer-based models [1]. The core challenge is that gene expression data lacks inherent sequence; the order of genes in a cell does not carry semantic meaning as words do in a sentence. scFMs address this by imposing a deterministic order or structure on the gene features.

Table 1: Common Tokenization Strategies for Single-Cell Foundation Models

Strategy Name	Core Principle	Typical Model Examples	Key Advantages	Key Limitations
Expression-Based Ranking	Genes are ordered by their expression value within each cell, and the top-k genes form the input sequence [1].	scGPT, Geneformer [1]	Simple, deterministic; captures most active genes.	Order is arbitrary and may not reflect biological gene-gene relationships.
Expression Binning	Genes are partitioned into bins (e.g., high/medium/low expression) based on their expression values, and the bin membership determines the token [1].	scBERT [1]	Reduces vocabulary size; can capture coarse-grained expression levels.	Loss of fine-grained, continuous expression information.
Direct Normalized Counts	Uses normalized count values (or their log-transform) directly as input features without complex sequencing [1].	Some scFMs [1]	Preserves full, continuous expression information.	Model must learn to handle high dimensionality and sparsity directly.
Convolutional Tokenization	The entire gene expression vector is segmented into fixed-size windows, and 1D-convolution is applied to generate local feature tokens [23].	scSFUT [23]	Eliminates need for gene selection; uses full gene vector; expands attention receptive field.	Computationally intensive; less interpretable at the single-gene level.

Quantitative Comparison of Strategy Performance

The performance of a tokenization strategy is intrinsically linked to the downstream task. For in silico perturbation (ISP) prediction, the "closed-loop" framework—which incorporates experimental perturbation data during model fine-tuning—has demonstrated significant improvements over "open-loop" approaches. The following table summarizes performance metrics from a benchmark study that utilized a Geneformer model, highlighting the impact of data integration on ISP accuracy [9].

Table 2: Performance of Open-Loop vs. Closed-Loop In Silico Perturbation Prediction in T-Cell Activation [9]

Prediction Method	Positive Predictive Value (PPV)	Negative Predictive Value (NPV)	Sensitivity	Specificity	AUROC
Differential Expression (DE)	3%	78%	40%	50%	Not Reported
Open-Loop ISP	3%	98%	48%	60%	0.63
DE + Open-Loop ISP Overlap	7%	Not Reported	Not Reported	Not Reported	Not Reported
Closed-Loop ISP	9%	99%	76%	81%	0.86

A critical finding for practical implementation is that the performance of the closed-loop model improved dramatically with just 10 perturbation examples and approached saturation with approximately 20 examples, indicating that even modest experimental validation can substantially enhance predictive accuracy [9].

Detailed Protocols for Tokenization and In Silico Perturbation

Protocol 1: Expression-Based Ranking and Tokenization for scFM Fine-Tuning

This protocol details the steps for fine-tuning a pre-trained scFM, like Geneformer, using an expression-based ranking tokenization strategy for a specific in silico perturbation task [9] [1].

Materials and Reagents:

Hardware: Workstation with >= 32 CPUs, >= 64 GB RAM, and >= 64 GB free storage [24].
Software: Command-line interface (Bash), Python, and relevant machine learning libraries (PyTorch/TensorFlow). Pre-trained scFM (e.g., Geneformer).
Data: scRNA-seq dataset (FASTQ or count matrices) for the biological system of interest (e.g., T-cells, HSCs). For closed-loop learning, Perturb-seq data for the same system is required [9].

Method Details:

Data Preprocessing and Quality Control:
- Quality Control: Filter cells with low gene counts and genes with low expression across cells. Perform log-normalization (e.g., log10(gexp + 1)) to stabilize variance and manage long-tailed distributions [25].
- Data Integration (if multiple batches): Apply batch effect correction methods like ComBat if integrating data from multiple studies or platforms, though some scFMs report robustness to batch effects without specific correction [1] [26].
Tokenization and Input Sequencing:
- For each cell, rank all genes by their normalized expression value.
- Select the top 2,000 - 6,000 genes (model-dependent) to form the input sequence for that cell [1].
- Convert each gene's identifier (e.g., Ensembl ID) and its expression value into a combined token embedding. The sequence of these tokens, in the order of their rank, represents the "sentence" for the cell.
Model Fine-Tuning:
- Initialize the model with pre-trained weights from the scFM.
- Fine-tune the model on the tokenized sequences from your target dataset. The learning objective is typically a classification task (e.g., activated vs. resting T-cells) or a regression task to predict a cellular state [9].
In Silico Perturbation Prediction:
- To simulate a gene knockout, set the expression value of the target gene to zero in the input data and re-tokenize.
- To simulate overexpression, artificially elevate the expression value of the target gene beyond its normal range and re-tokenize.
- Feed the modified token sequence through the fine-tuned model and compare the output embedding or prediction to the unperturbed state. A significant shift indicates a predicted phenotypic change [9].

Protocol 2: Implementing a Closed-Loop ISP Framework

This protocol extends Protocol 1 by iteratively incorporating experimental data to refine the scFM, dramatically improving ISP accuracy [9].

Method Details:

Initial Model and Perturbation Screening:
- Fine-tune a scFM on baseline scRNA-seq data (e.g., resting and activated T-cells) as in Protocol 1.
- Perform open-loop ISP on a wide panel of genes to generate initial predictions of which perturbations shift the cell state.
Experimental Validation and Data Integration:
- Select top candidate perturbations from the open-loop screen for experimental validation using a method like Perturb-seq.
- Generate scRNA-seq data for cells subjected to the candidate genetic perturbations.
Closed-Loop Fine-Tuning:
- Combine the original baseline scRNA-seq data with the new Perturb-seq data. The Perturb-seq data should be labeled with the resulting cell state (e.g., activated), but not with the identity of the perturbed gene during this step [9].
- Re-fine-tune the pre-trained scFM on this combined dataset. This teaches the model the causal relationships between perturbations and outcomes.
- The resulting closed-loop model is now ready for a new, more accurate round of ISP.

Visualization of Workflows and Pathways

The following diagrams illustrate the core closed-loop framework and a key signaling pathway identified through its application.

Closed-Loop ISP Workflow

RUNX1-FPD Signaling Pathways

The Scientist's Toolkit

Table 3: Essential Research Reagents and Computational Tools for scFM and ISP

Item Name	Function / Application	Specific Examples / Notes
Pre-trained scFMs	Provides a foundational model that can be fine-tuned for specific tasks, saving computational resources.	Geneformer, scGPT, scBERT [1].
Containerization Platform	Ensures computational reproducibility by encapsulating the entire software environment.	Docker [24].
Integrated Pipelines	Provides pre-defined workflows for processing raw sequencing data into analyzable formats.	RumBall (for RNA-seq), bioBakery Workflows (for metagenomics) [24] [27].
Data Preprocessing Tools	Performs quality control, normalization, and batch effect correction on raw count matrices.	Scanpy in Python [23].
Perturbation Screening Tech	Experimentally validates in silico predictions and generates data for closed-loop learning.	CRISPRi/CRISPRa screens, Perturb-seq [9].
Reference Datasets	Used for model pretraining and as benchmarks for fine-tuned models.	CZ CELLxGENE, Human Cell Atlas, TCGA, GTEx [1] [26].

Self-supervised learning (SSL) has emerged as a transformative approach for analyzing single-cell transcriptome data, enabling researchers to extract meaningful biological insights from vast amounts of unlabeled data. By learning representations without manual annotation, SSL methods have demonstrated exceptional capability in capturing complex cellular states and functions, forming the foundational bedrock for advanced in silico perturbation modeling with single-cell foundation models (scFMs). This paradigm shift allows computational biologists to predict cellular responses to genetic and therapeutic interventions, accelerating therapeutic discovery—particularly for rare diseases where experimental data is scarce.

The power of SSL lies in its ability to leverage the intrinsic structure of single-cell RNA sequencing (scRNA-seq) data through pretext tasks that require the model to learn meaningful representations without explicit supervision. These pre-trained models can then be fine-tuned for specific downstream applications with remarkable efficiency. Within the context of scFMs research, SSL provides the essential pre-training framework that enables accurate prediction of perturbation effects, cell-type annotation, and data integration across diverse biological contexts.

Key Findings and Performance Benchmarks

Comparative Performance of SSL Strategies

Extensive benchmarking across multiple single-cell genomics datasets reveals the nuanced effectiveness of different SSL approaches. The following table summarizes key quantitative findings from large-scale studies evaluating SSL methods on millions of single cells:

Table 1: Performance comparison of self-supervised learning methods on single-cell transcriptomes

SSL Method	Key Application	Performance Metric	Result	Reference
Masked Autoencoder (Random masking)	Cell-type prediction (PBMC)	Macro F1 score	0.7466 ± 0.0057	[28]
Supervised Baseline	Cell-type prediction (PBMC)	Macro F1 score	0.7013 ± 0.0077	[28]
Masked Autoencoder (GP masking)	Cell-type prediction (Tabula Sapiens)	Macro F1 score	0.3085 ± 0.0040	[28]
Supervised Baseline	Cell-type prediction (Tabula Sapiens)	Macro F1 score	0.2722 ± 0.0123	[28]
Closed-loop ISP Framework	Perturbation prediction (T-cell activation)	Positive Predictive Value	9% (vs. 3% open-loop)	[9]
scPML	Cross-platform cell annotation	Accuracy	0.87 (mean)	[29]
Geneformer	Cross-platform cell annotation	Accuracy	0.72 (mean)	[29]

Critical Insights on SSL Effectiveness

Research indicates that SSL demonstrates particularly strong performance in specific biological scenarios:

Transfer learning applications: SSL pre-training on large auxiliary datasets (e.g., scTab with >20 million cells) significantly improves performance on smaller target datasets for cell-type prediction and gene-expression reconstruction [28]. Improvements are most pronounced for underrepresented cell types, as evidenced by stronger gains in macro F1 scores compared to micro F1 scores [28].
Architectural advantages: Masked autoencoders consistently outperform contrastive learning methods in single-cell genomics applications, contrary to trends observed in computer vision [28] [30]. This advantage is maintained across multiple masking strategies, including random masking and biologically-informed gene program masking.
Data efficiency: The "closed-loop" framework for perturbation modeling demonstrates that incorporating even small amounts of experimental data (10-20 perturbation examples) during fine-tuning can substantially improve prediction accuracy [9].

Experimental Protocols

SSL Pre-training Protocol for Single-Cell Transcriptomes

Data Preparation and Preprocessing

Data Collection: Assemble a large-scale single-cell transcriptomics dataset for pre-training. The CELLxGENE census scTab dataset comprising over 20 million cells across diverse tissues and conditions serves as an ideal starting point [28]. Include all 19,331 human protein-encoding genes to maximize generalizability.
Quality Control: Apply standard scRNA-seq quality control metrics:
- Remove cells with fewer than 200 detected genes
- Exclude cells with high mitochondrial read percentage (>20%)
- Filter out genes expressed in fewer than 10 cells
Normalization: Normalize gene expression values using standard scRNA-seq processing:
- Apply library size normalization to obtain counts per 10,000 (CPT)
- Log-transform using log1p (log(1+CPT))
- Scale features to zero mean and unit variance

Model Architecture and Training

Network Architecture: Implement a fully connected autoencoder network with the following specifications [28]:
- Input layer: 19,331 neurons (one per human protein-coding gene)
- Bottleneck layer: 512 neurons (compressed representation)
- Output layer: 19,331 neurons (reconstruction)
- Activation functions: ReLU for hidden layers, linear/sigmoid for output
Pretext Task Implementation - Masked Autoencoding:
- Apply random masking to 30% of input features (genes)
- Alternative strategy: Implement gene program (GP) masking using biologically defined gene sets
- Train the model to reconstruct masked features using mean squared error loss computed only on masked positions
Training Specifications:
- Optimization: Adam optimizer with learning rate of 0.001
- Batch size: 256 cells
- Training epochs: 100-200 with early stopping
- Hardware: High-performance GPU cluster (e.g., NVIDIA A100)

Protocol for In Silico Perturbation Modeling

Foundation Model Fine-tuning

Base Model Initialization: Start with a foundation model pre-trained on large-scale single-cell data (e.g., Geneformer) [9] [11].
Task-Specific Fine-tuning:
- Prepare labeled dataset of perturbation responses (e.g., CRISPR screens)
- Add a classification or regression head appropriate for the prediction task
- Fine-tune with a lower learning rate (1e-5 to 1e-4) to adapt the pre-trained weights
Closed-Loop Framework Implementation [9]:
- Incorporate experimental perturbation data during fine-tuning
- Use as few as 10-20 perturbation examples to substantially improve prediction accuracy
- Implement iterative refinement cycles where model predictions guide subsequent experiments

Perturbation Effect Prediction

In Silico Perturbation Simulation:
- For gene knockout: Set target gene expression to zero in the input vector
- For gene overexpression: Increase target gene expression by 2-3 standard deviations
- Pass modified input through the fine-tuned model to predict transcriptomic response
Validation and Interpretation:
- Compare predictions to held-out experimental data
- Analyze predicted expression changes in pathway context
- Prioritize candidate genes based on effect size and confidence metrics

Visualizing SSL Frameworks and Workflows

Self-Supervised Learning Framework for Single-Cell Transcriptomics

Closed-Loop In Silico Perturbation Framework

Table 2: Key research reagents and computational resources for SSL in single-cell transcriptomics

Resource	Type	Function/Application	Example/Reference
scTab Dataset	Data Resource	Large-scale reference dataset for SSL pre-training; contains >20 million cells	CELLxGENE census [28]
Masked Autoencoder	Algorithm	SSL method for learning representations through reconstruction of masked input features	[28] [30]
Gene Program Annotations	Biological Knowledge	Curated gene sets for biologically-informed masking strategies	Pathway databases [29]
Geneformer	Foundation Model	Pre-trained transformer model for single-cell transcriptomics	[9] [11]
Closed-Loop Framework	Methodology	Approach for incorporating experimental data to improve perturbation predictions	[9]
scPML	Software Tool	Pathway-based multi-view learning for cell type annotation	[29]
Perturb-seq Data	Experimental Data	Single-cell CRISPR screening data for perturbation model training	[9] [11]

Self-supervised learning represents a paradigm shift in the analysis of single-cell transcriptomes, providing a powerful framework for extracting biological insights from unlabeled data at scale. The protocols and applications outlined in this document demonstrate the tangible benefits of SSL in enhancing cell-type annotation, data integration, and—most critically—predicting cellular responses to perturbations. As the field progresses toward more sophisticated "virtual cell" models, SSL will continue to serve as the foundational element enabling accurate in silico experiments and accelerating therapeutic discovery, particularly for rare diseases where experimental data remains limited. The integration of SSL pre-training with closed-loop experimental validation creates a powerful cycle of discovery that promises to transform computational biology and drug development.

Implementing In Silico Perturbation Prediction: From Virtual Cells to Therapeutic Discovery

In silico perturbation (ISP) represents a transformative computational approach in cellular biology, enabling researchers to predict the effects of genetic manipulations—such as gene knockouts and overexpression—without conducting costly and time-intensive laboratory experiments. This methodology leverages single-cell Foundation Models (scFMs), which are large-scale deep learning models pre-trained on vast datasets comprising millions of single-cell transcriptomes [1]. These models learn fundamental principles of cellular biology and gene regulation, allowing them to be fine-tuned for specific tasks like predicting transcriptional changes following genetic perturbations [9] [31]. The core premise of ISP is the creation of "virtual cells" that can simulate cellular responses to diverse perturbations, thus accelerating biological discovery and therapeutic development, particularly for rare diseases where patient samples are scarce [9].

The workflow operates by treating individual cells as "sentences" and genes or genomic features as "words" or "tokens" [1]. Through sophisticated tokenization and embedding processes, scFMs can model the complex, high-dimensional relationships within gene regulatory networks. When a perturbation is simulated, the model predicts how the removal (knockout) or enhanced expression (overexpression) of specific genes alters the transcriptional state of the cell [9] [32]. This capability is invaluable for prioritizing gene targets for functional validation, understanding disease mechanisms, and identifying potential therapeutic interventions [9].

Workflow and Methodology

The ISP workflow involves a sequence of critical steps, from data preparation and model setup to the execution and validation of in silico experiments. The following diagram illustrates the logical flow and key decision points in a standard ISP pipeline.

Data Preparation and Tokenization

The initial phase involves curating high-quality single-cell RNA sequencing (scRNA-seq) data, which serves as the input for the scFM. The model requires a gene-by-cell count matrix from wild-type (WT) samples [32]. A critical challenge is that gene expression data lacks inherent sequential order, unlike words in a sentence. To address this, scFMs employ various tokenization strategies to structure the data for the model:

Gene Ranking by Expression: Genes within each cell are ranked by their expression levels, and the ordered list of top genes is treated as the input "sentence" [1].
Expression Binning: Genes are partitioned into bins based on their expression values, and these rankings determine their positions in the sequence [1].
Gene Identifier and Value Embedding: Each gene is typically represented as a token embedding that combines a gene identifier with its expression value in the given cell [1].

Additional special tokens may be incorporated to provide biological context, such as cell identity metadata, modality indicators for multi-omics data, or gene ontology information [1]. Positional encoding schemes are then applied to represent the relative order or rank of each gene in the cell.

Model Selection and Configuration

Selecting an appropriate scFM is crucial for ISP success. Current models vary in their architectures, pretraining data, and specific capabilities. The table below summarizes key models and their applications in ISP.

Table 1: Single-Cell Foundation Models for In Silico Perturbation

Model Name	Architecture Type	Key ISP Features	Perturbation Types Supported	Notable Applications
Geneformer [9] [31]	Transformer-based Encoder	Predicts direction of cell state shift (e.g., toward activation or rest); Can be used in open or closed-loop modes.	Knockout, Overexpression	T-cell activation studies, RUNX1-familial platelet disorder target identification.
scGPT [11] [31]	GPT-like Decoder	Predicts post-perturbation transcriptomes; Can be combined with a linear decoder for perturbation tasks.	Single/double gene knockout	Benchmarking studies on CRISPRa/i datasets.
scTenifoldKnk [32]	Tensor-based Workflow	Constructs Gene Regulatory Networks (GRNs) from WT data; virtually deletes a gene from the GRN to identify differentially regulated genes.	Virtual knockout	Systematic virtual KO analysis; recapitulation of real-animal KO findings.
Large Perturbation Model (LPM) [31]	PRC-disentangled Decoder	Integrates diverse perturbation data (genetic, chemical); disentangles Perturbation, Readout, and Context dimensions.	CRISPR, Chemical compounds	Predicting outcomes of unobserved experiments, mapping compound-CRISPR shared space.

Executing the In Silico Perturbation

The core of the ISP workflow involves applying the selected and configured model to simulate the genetic perturbation.

For Gene Knockout Simulation

GRN-Based Approach (scTenifoldKnk): The Wild-Type Gene Regulatory Network (WT scGRN) is constructed from the input data. The target gene is then "virtually deleted" by setting the entire row corresponding to that gene in the adjacency matrix of the WT scGRN to zero, creating a pseudo-KO scGRN. Manifold alignment is used to compare the WT and pseudo-KO scGRNs to identify differentially regulated (DR) genes [32].
Encoder-Based Approach (Geneformer): The model, fine-tuned to classify cell states, is used to predict the direction of cell state change (e.g., toward a resting or activated state in T-cells) upon in silico knockout of a specific gene. The magnitude of the predicted shift indicates the gene's importance in maintaining the cell state [9].
Decoder-Based Approach (scGPT, LPM): The model is tasked with predicting the complete post-perturbation transcriptome given the perturbation (e.g., "knockout of Gene X") and the cellular context as input [11] [31].

For Gene Overexpression Simulation

Simulating overexpression often uses similar underlying architectures as knockout simulations. The key difference lies in how the perturbation is represented to the model. Instead of removing a gene's influence, the model is instructed to predict the transcriptional consequences of elevated expression of the target gene. For example, in Geneformer, this involves inputting a command to overexpress the gene and analyzing the predicted shift in the cell's embedding within the state space [9].

The Closed-Loop Framework for Enhanced Accuracy

A significant advancement in ISP is the "closed-loop" framework, which iteratively improves model predictions by incorporating experimental data [9]. The process is as follows:

Initial (Open-loop) Prediction: The base scFM is used to perform an initial ISP screen.
Experimental Validation: A subset of the top predictions is selected for experimental testing (e.g., via Perturb-seq).
Model Fine-tuning: The scFM is fine-tuned using the scRNA-seq data from the experimental perturbation screen alongside the original training data.
Refined (Closed-loop) Prediction: The fine-tuned model performs a second round of ISP, yielding more accurate predictions.

This framework has been shown to increase the Positive Predictive Value (PPV) of ISP three-fold, from 3% to 9%, while also improving sensitivity and specificity. Notably, performance improvements can saturate with as few as 20 experimental perturbation examples incorporated during fine-tuning [9].

Performance Benchmarking and Validation

Rigorous benchmarking is essential to assess the predictive power and limitations of ISP methods. A critical finding from recent large-scale benchmarks is that the performance of complex deep learning models must be compared against deliberately simple baselines [11].

Key Performance Metrics

A comprehensive evaluation of ISP models requires multiple metrics to capture different aspects of performance, as summarized in the table below.

Table 2: Key Metrics for Evaluating In Silico Perturbation Predictions

Metric	Definition	Interpretation in ISP Context	Key Findings from Recent Studies
L2 Distance / R²	Measures the overall agreement between predicted and observed gene expression values.	Assesses general transcriptome-wide prediction accuracy.	High R² does not guarantee good performance in identifying biologically significant changes [33]. Complex models do not consistently outperform simple additive baselines for double perturbation prediction [11].
Area Under the Precision-Recall Curve (AUPRC)	Evaluates the precision and recall of identifying Differentially Expressed (DE) genes.	Directly measures the ability to detect biologically relevant, perturbed genes.	Models with high R² can have low AUPRC, highlighting the metric's importance for biologically relevant assessment [33].
Positive Predictive Value (PPV)	The proportion of predicted positive effects that are true positives.	Indicates the reliability of a predicted hit (e.g., a gene that shifts cell state).	Open-loop ISP had a PPV of 3% for T-cell activation, which increased to 9% with closed-loop fine-tuning [9].
Sensitivity / Recall	The proportion of true positives correctly identified.	Measures the model's ability to find all relevant hits.	Improved from 48% (open-loop) to 76% (closed-loop) in T-cell activation studies [9].
Specificity	The proportion of true negatives correctly identified.	Measures the model's ability to rule out non-hits.	Improved from 60% (open-loop) to 81% (closed-loop) in T-cell activation studies [9].

Simple Baselines for Comparison

Benchmarking studies have established that simple models provide a crucial baseline for evaluation [11]:

The 'No Change' Baseline: Always predicts the same expression as in the control condition.
The 'Additive' Baseline: For a double perturbation, predicts the sum of the individual logarithmic fold changes (LFCs) from single perturbations.
The 'Mean' Baseline: Always predicts the mean expression across the training set perturbations.

A landmark study found that none of the five tested foundation models and two other deep learning models outperformed these simple baselines in predicting transcriptome changes after double perturbations [11]. This underscores the importance of critical benchmarking and the need for continued method development.

Application Notes and Experimental Protocols

Protocol 1: Target Identification for a Rare Disease (RUNX1-FPD)

This protocol details the application of the closed-loop ISP framework for target discovery in RUNX1-Familial Platelet Disorder (RUNX1-FPD), a rare hematologic disease [9].

Cell Model Generation: Engineer human Hematopoietic Stem Cells (HSCs) to harbor RUNX1 loss-of-function mutations, mimicking the patient condition. Use HSCs with a guide RNA targeting a safe harbor site (e.g., AAVS1) as controls.
scRNA-seq Data Generation: Perform single-cell RNA sequencing on both the RUNX1-mutant and control HSCs to generate transcriptomic profiles.
Model Fine-tuning: Fine-tune a Geneformer model to classify and distinguish between the RUNX1-mutant and control HSC states based on the scRNA-seq data.
Open-loop ISP Screening: Use the fine-tuned model to perform in silico knockouts of all genes in the genome. The goal is to identify genes whose virtual deletion shifts the RUNX1-mutant HSCs transcriptomically toward the control state.
Target Prioritization and Cross-Validation: Compare the ISP results with those from differential expression (DE) analysis. Prioritize genes that are significant hits in both ISP and DE analyses for further validation.
Closed-loop Refinement (Optional): If resources permit, experimentally perturb the top candidate genes (e.g., via CRISPRi) in the RUNX1-FPD HSC model and sequence. Use this data to fine-tune the Geneformer model in a closed-loop, then re-run the ISP to generate a refined list of high-confidence targets.
Experimental Validation: Test the top candidate genes using specific small-molecule inhibitors or CRISPR-based perturbation and assess functional rescue of the platelet disorder phenotype in vitro.

Protocol 2: Simulating Double Gene Perturbations

This protocol is designed for predicting genetic interactions and the effects of combinatorial gene perturbations [11] [32].

Data Preparation: Obtain a single-cell dataset (e.g., the Norman et al. dataset) that includes profiles for unperturbed cells, cells with single-gene perturbations, and cells with double-gene perturbations.
Data Partitioning: Split the double perturbation data into training and test sets (e.g., 62 for training and 62 for testing), while including all available single perturbation data in the training set.
Model Training and Fine-tuning: Train or fine-tune the chosen model (e.g., GEARS, scGPT) on the training set (single perturbations and a portion of the double perturbations).
Prediction and Benchmarking: Use the trained model to predict the transcriptomic outcomes for the held-out double perturbations in the test set. Compare the model's predictions against the ground truth data and the predictions from simple additive and 'no change' baselines.
Identification of Genetic Interactions: For each double perturbation in the test set, calculate the difference between the model's predicted expression and the additive expectation (the sum of the two single perturbation LFCs). Interactions are called if this difference exceeds a statistically determined threshold.
Characterization of Interaction Type: Classify the identified interactions as:
- Buffering: The double perturbation effect is weaker than expected.
- Synergistic: The double perturbation effect is stronger than expected.
- Opposite: The double perturbation effect is in the opposite direction than expected.

The Scientist's Toolkit: Research Reagent Solutions

Successful implementation of ISP workflows relies on a combination of computational tools, biological data, and experimental reagents. The following table catalogues essential resources for the field.

Table 3: Essential Research Reagents and Resources for In Silico Perturbation

Category	Item / Resource	Specifications / Example	Function in ISP Workflow
Computational Tools & Models	Geneformer [9] [31]	A transformer model pre-trained on millions of single-cell transcriptomes.	Fine-tuned for predicting direction of cell state change upon perturbation.
	scGPT [11] [31]	A generative pre-trained transformer model for single-cell biology.	Predicts high-dimensional transcriptome changes after genetic perturbations.
	scTenifoldKnk [32]	A machine learning workflow for virtual KO using tensor decomposition and manifold alignment.	Performs virtual KO analysis using only WT scRNA-seq data to infer gene function.
	Large Perturbation Model (LPM) [31]	A decoder-only model that disentangles Perturbation, Readout, and Context.	Integrates diverse perturbation data types (genetic, chemical) for outcome prediction.
Data Resources	CZ CELLxGENE [1]	A platform providing unified access to over 100 million annotated single cells.	Source of diverse, high-quality scRNA-seq data for model pre-training and fine-tuning.
	Perturb-seq Datasets [11] [9]	e.g., Norman et al., Replogle et al.	Provides ground-truth scRNA-seq data from genetic screens for model training and benchmarking.
Experimental Reagents (for Validation)	CRISPR Activation/Interference (CRISPRa/i)	e.g., dCas9-VPR, dCas9-KRAB systems.	For experimental validation of ISP predictions via targeted gene overexpression or knockdown.
	Primary Human T-cells [9]	Isolated from healthy donors.	A biologically relevant system for validating ISP predictions related to immune activation.
	Engineered Human HSCs [9]	e.g., RUNX1-knockout models of RUNX1-FPD.	A disease model for validating ISP-predicted therapeutic targets in a rare genetic disorder.

The In Silico Perturbation workflow, powered by single-cell foundation models, provides a powerful and scalable framework for simulating genetic knockouts and overexpression. While current models show promise, benchmarking reveals that their performance against simple baselines requires careful evaluation [11]. The adoption of a closed-loop framework, which incorporates experimental data into model fine-tuning, significantly enhances prediction accuracy and represents a crucial step toward realizing the potential of "virtual cell" models for biomedical discovery [9]. As models evolve and integrate more diverse data types [31], ISP is poised to become an indispensable tool for functional genomics and therapeutic target identification.

The ability to accurately predict how a cell will respond to a genetic or chemical perturbation represents a significant unsolved challenge in biology with profound implications for understanding disease mechanisms and accelerating therapeutic development. Single-cell foundation models (scFMs) have emerged as powerful deep learning tools pre-trained on vast amounts of single-cell transcriptomics data, enabling in silico perturbation (ISP) predictions that simulate cellular responses without extensive experimental validation [9] [1]. These models represent an important step toward creating "virtual cells" that can simulate cellular responses to diverse perturbations, holding particular value for rare diseases where patient samples are scarce and experimental screening is challenging [9].

However, current "open-loop" scFMs face a critical limitation: while they generate predictions that can be experimentally tested, they cannot learn from these experiments to create better predictions [9]. This open-loop approach leaves a significant gap between computational prediction and experimental validation. Closing this loop represents a crucial step toward realizing the full potential of virtual cell models for biomedical discovery. This protocol details the methodology for implementing a closed-loop framework that extends scFMs by incorporating experimental perturbation data during model fine-tuning, substantially improving prediction accuracy and biological relevance [9].

Core Principles of Single-Cell Foundation Models

Architectural Foundations

Single-cell foundation models typically employ transformer-based architectures that learn from massive single-cell datasets through self-supervised pretraining [1]. In these models, individual cells are treated analogously to sentences, and genes or genomic features along with their expression values are treated as words or tokens [1]. The model learns fundamental principles of cellular organization that can be generalized to new datasets or downstream tasks through attention mechanisms that weight relationships between gene tokens [1].

Two predominant architectural approaches have emerged:

Encoder-based models (e.g., BERT-like): Utilize bidirectional attention mechanisms that learn from all genes in a cell simultaneously [1].
Decoder-based models (e.g., GPT-like): Employ unidirectional masked self-attention that iteratively predicts masked genes conditioned on known genes [1].

For perturbation prediction, these models are typically fine-tuned on specific cellular states and then used to simulate the effects of genetic perturbations such as gene knockouts or overexpression [9].

Current Limitations and Benchmarking Results

Despite their theoretical promise, critical benchmarking studies reveal significant limitations in current scFMs for perturbation prediction. A comprehensive assessment published in Nature Methods demonstrated that five foundation models and two other deep learning models failed to outperform deliberately simple baselines for predicting transcriptome changes after single or double perturbations [11]. The simple "additive" model that predicts the sum of individual logarithmic fold changes consistently outperformed more complex deep learning approaches [11].

Similarly, the PertEval-scFM benchmarking framework found that scFM embeddings offer limited improvement over simple baseline models in zero-shot settings, particularly under distribution shift [16]. These findings highlight the ongoing challenges in perturbation effect prediction and underscore the need for frameworks that can enhance model performance through iterative improvement.

Closed-Loop Framework Protocol

Conceptual Workflow

The closed-loop framework introduces an iterative feedback mechanism wherein experimental perturbation data is incorporated into model fine-tuning, creating a cycle of continuous improvement between in silico predictions and experimental validation [9]. This approach fundamentally transforms scFMs from static prediction tools into adaptive learning systems that become increasingly accurate with each experimental cycle.

Table 1: Key Performance Improvements with Closed-Loop Framework in T-cell Activation Model

Metric	Open-Loop ISP	Closed-Loop ISP	Improvement
Positive Predictive Value (PPV)	3%	9%	3-fold increase
Negative Predictive Value (NPV)	98%	99%	1% increase
Sensitivity	48%	76%	58% increase
Specificity	60%	81%	35% increase
AUROC	0.63	0.86	36% increase

Experimental Protocol for T-cell Activation Model

Initial Model Fine-tuning

Data Collection: Compile single-cell RNA sequencing (scRNA-seq) data from multiple studies where T cells were stimulated via CD3-CD28 beads or phorbol myristate acetate/ionomycin (PMA/ionomycin) [9].
Model Selection: Utilize Geneformer-30M-12L (or similar scFM) as the base model [9].
Fine-tuning Procedure:
- Format input data using the model's standard tokenization approach (typically ranking genes by expression levels within each cell) [1].
- Fine-tune the model to classify T-cell activation status using the compiled datasets.
- Validate model performance on a hold-out test set of cells, targeting accuracy >99% and macroF1 >0.99 [9].

Open-Loop ISP Screening and Validation

In Silico Perturbation:
- Perform ISP across 13,161 genes, simulating both gene overexpression (CRISPRa) and knockout (CRISPRi) [9].
- Generate predictions for how each perturbation shifts T cells toward activated or resting states.
Experimental Validation:
- Validate ISP predictions against orthogonal flow cytometry data from CRISPRi/CRISPRa screens measuring IL-2 and IFN-γ production after CD3-CD28 stimulation [9].
- Calculate baseline performance metrics (PPV, NPV, sensitivity, specificity) for open-loop predictions [9].

Closed-Loop Implementation

Perturbation Data Incorporation:
- Fine-tune the model with scRNA-seq data from CRISPR activation and interference screens in primary human T cells (Perturb-seq) alongside existing scRNA-seq data from resting and activated T cells [9].
- Note: The Perturb-seq data should be labeled only with activation status, not with which specific gene was perturbed [9].
Iterative Refinement:
- Perform ISP using the newly fine-tuned model on all genes except those perturbed in the experimental screens.
- Compare performance metrics against open-loop baseline.
- Determine optimal number of perturbation examples needed for substantial improvement through incremental additions of random perturbation subsets [9].

Table 2: Minimum Perturbation Examples Required for Performance Improvement

Number of Examples	Sensitivity	Specificity	Performance Level
10 examples	61% (95% CI: 58-64%)	66% (95% CI: 62-70%)	Substantial improvement
20 examples	76% (95% CI: 72-78%)	79% (95% CI: 75-83%)	Performance saturation
>20 examples	No significant improvement	No significant improvement	Diminishing returns

Application to RUNX1-Familial Platelet Disorder

Disease Context

RUNX1-familial platelet disorder (RUNX1-FPD) is a rare pediatric hematologic disease affecting approximately 20,000 people in the US, characterized by thrombocytopenia, impaired platelet function, and increased risk of early-onset myeloid neoplasms [9]. Currently, no interventions exist to prevent progression to myeloid malignancies [9].

Experimental Protocol

Model System Development:
- Leverage human hematopoietic stem cells (HSCs) engineered with RUNX1 loss-of-function mutations modeling RUNX1-FPD [9].
- Generate scRNA-seq data from RUNX1-engineered HSCs and control HSCs targeting AAVS1 [9].
- Validate the engineered system against scRNA-seq data from RUNX1-FPD patient HSCs, confirming concordance in downstream RUNX1 target expression [9].
Model Fine-tuning:
- Fine-tune Geneformer-30M-12L to classify HSCs between RUNX1-engineered and control HSCs [9].
- Verify successful distinction of these two cell states through evaluation metrics.
Therapeutic Target Identification:
- Perform open-loop ISP to identify genes that, when deleted, shift RUNX1-knockout HSCs toward a control-like state [9].
- Compare differential expression (DE) and ISP results to identify overlapping gene targets.
- Select candidate genes with available specific small molecule inhibitors for experimental validation [9].

Research Reagent Solutions

Table 3: Essential Research Reagents for Closed-Loop Framework Implementation

Reagent/Category	Specific Examples	Function/Application
Single-cell Foundation Models	Geneformer-30M-12L, scGPT, scFoundation	Base models for fine-tuning and perturbation prediction [9] [11]
Genetic Perturbation Systems	CRISPRi, CRISPRa, Perturb-seq	Experimental generation of perturbation data for model training [9]
Validation Assays	Flow cytometry (IL-2, IFN-γ production), scRNA-seq	Orthogonal validation of in silico predictions [9]
Cell Model Systems	Primary human T cells, RUNX1-engineered HSCs	Biological contexts for model development and testing [9]
Computational Frameworks	PertEval-scFM	Benchmarking and evaluation of perturbation predictions [16]

Key Findings and Therapeutic Applications

The implementation of the closed-loop framework in T-cell activation models demonstrated a three-fold increase in positive predictive value (from 3% to 9%) with concurrent improvements in negative predictive value (99%), sensitivity (76%), and specificity (81%) compared to open-loop approaches [9]. The area under the receiver operator characteristic curve (AUROC) significantly improved from 0.63 (95% CI: 0.58-0.68) to 0.86 (95% CI: 0.83-0.89) [9].

Application to RUNX1-FPD identified novel therapeutic targets and pathways, including:

mTOR and CD74-MIF signaling axis as direct therapeutic targets
Protein kinase C and phosphoinositide 3-kinase as novel pathways
Fourteen genes predicted by both DE and ISP to significantly shift RUNX1-knockout cells toward control cells [9]

From these targets, eight genes with available specific small molecule inhibitors were selected for experimental validation, including PRKCB and UBB [9]. This demonstrates the framework's potential for accelerating rare disease drug discovery by prioritizing the most promising therapeutic targets for experimental validation.

The closed-loop framework for integrating experimental perturbation data into scFM fine-tuning represents a significant advancement in in silico perturbation modeling. By creating an iterative feedback loop between computational predictions and experimental validation, this approach substantially improves prediction accuracy and biological relevance. The methodology detailed in this protocol provides researchers with a standardized approach for implementing this framework across diverse biological contexts and disease models.

Future development should focus on expanding the framework to incorporate diverse data modalities, improving model architectures specifically for perturbation prediction, and addressing current limitations identified in benchmarking studies [11] [16]. As these frameworks mature, they hold tremendous promise for accelerating therapeutic discovery, particularly for rare diseases where conventional screening approaches are impractical.

RUNX1-Familial Platelet Disorder (RUNX1-FPD) is a rare autosomal dominant inherited condition characterized by thrombocytopenia, impaired platelet function, and a pronounced predisposition to develop myeloid malignancies, most commonly myelodysplastic syndrome (MDS) and acute myeloid leukemia (AML) [34]. The disease is caused by germline loss-of-function mutations in the RUNX1 gene, a crucial transcription factor in hematopoiesis. The estimated risk of progressing to a myeloid malignancy is approximately 40%, with a median age of onset of 33 years, though cases have been reported from age 2 to 72 [34] [35]. Affecting over 18,000 people in the United States, RUNX1-FPD presents significant clinical challenges due to the scarcity of patient samples and the lack of interventions to prevent leukemic transformation [9].

The clinical presentation is marked by significant phenotypic heterogeneity, even among family members carrying the identical RUNX1 mutation. A documented case study illustrates this variability: a 5-year-old boy presented with isolated thrombocytopenia, his mother developed MDS at 27 years, while his maternal grandfather remained asymptomatic with a normal platelet count at 60 years of age [34]. This heterogeneity complicates clinical prognosis and underscores the need for personalized therapeutic strategies. The molecular pathogenesis often involves subsequent somatic mutations in genes such as BCOR, PTPN11, KRAS, and TET2, which likely contribute to disease progression [34].

Technical Foundation: From Open-Loop to Closed-Loop ISP

In Silico Perturbation (ISP) with single-cell foundation models (scFMs) represents a paradigm shift in biomedical research. scFMs are large-scale deep learning models, typically based on Transformer architectures, pre-trained on vast single-cell RNA sequencing (scRNA-seq) datasets. They learn the fundamental "language" of cells, where individual cells are treated as sentences and genes or genomic features as words [1] [36]. A key application is ISP, which simulates cellular responses to genetic perturbations (e.g., gene knockouts or overexpression) computationally, acting as a "virtual cell" platform [9]. This is particularly valuable for rare diseases like RUNX1-FPD, where experimental screening with patient samples is severely limited.

The standard open-loop ISP approach involves fine-tuning an scFM, such as Geneformer, on a target cellular state (e.g., RUNX1-knockout Hematopoietic Stem Cells (HSCs) vs. controls) and then predicting genes that, when perturbed, shift the diseased state toward a healthy one [9]. However, this open-loop paradigm has a critical limitation: its predictions are made in a vacuum, without the ability to learn from subsequent experimental validation.

The closed-loop ISP framework introduces a crucial iterative feedback mechanism. After the initial ISP predictions are generated, they are experimentally tested. The scRNA-seq data from these experimental perturbations are then incorporated back into the model during a subsequent fine-tuning round. This "closes the loop," allowing the model to learn from empirical data and refine its predictive capabilities [9]. The entire workflow, from model setup to therapeutic discovery, is outlined below.

Application Notes: Implementing Closed-Loop ISP for RUNX1-FPD

Benchmarking and Advantages of the Closed-Loop Approach

The implementation of the closed-loop framework demonstrates a substantial quantitative improvement over traditional open-loop ISP. In the context of T-cell activation, a model system for benchmarking, the incorporation of even a small number of experimental perturbation examples during fine-tuning dramatically enhanced predictive performance [9].

Table 1: Performance Comparison of Open-Loop vs. Closed-Loop ISP for T-cell Activation (based on [9])

Metric	Open-Loop ISP	Closed-Loop ISP	Relative Improvement
Positive Predictive Value (PPV)	3%	9%	3-fold increase
Negative Predictive Value (NPV)	98%	99%	Marginal improvement
Sensitivity	48%	76%	1.6-fold increase
Specificity	60%	81%	1.35-fold increase
AUROC	0.63	0.86	36% increase

A critical finding was the data efficiency of the closed-loop approach. Performance metrics improved dramatically with just 10 perturbation examples (Sensitivity: 61%, Specificity: 66%) and began to saturate after incorporating approximately 20 examples (Sensitivity: 76%, Specificity: 79%). This indicates that even a modest number of experimental validations can substantially enhance model accuracy, making the approach feasible for research on rare diseases where data is scarce [9].

Target Discovery for RUNX1-FPD

Applying the closed-loop ISP framework to RUNX1-FPD, researchers began by fine-tuning the Geneformer model on human HSCs engineered with RUNX1 loss-of-function mutations, which showed high concordance with patient-derived HSCs [9]. The model was tasked with identifying genes whose deletion would shift the RUNX1-knockout HSCs toward a control-like state.

The initial open-loop ISP, combined with differential expression (DE) analysis, identified 14 high-confidence candidate genes predicted by both methods. From this list, eight genes with available specific small molecule inhibitors were selected for further investigation [9]. The closed-loop process helped prioritize the most promising therapeutic targets and pathways.

Table 2: Therapeutic Targets and Pathways Identified via Closed-Loop ISP for RUNX1-FPD (based on [9])

Category	Target/Pathway	Potential Therapeutic Agent	Proposed Mechanism
Primary Targets	mTOR signaling	mTOR inhibitors (e.g., Rapamycin)	Corrects dysregulated protein synthesis and cell growth in RUNX1-deficient HSCs.
	CD74-MIF signaling axis	MIF inhibitors	Modulates inflammatory signaling implicated in the disease phenotype.
Novel Pathways	Protein Kinase C (PKC)	PKC inhibitors	Targets dysregulated intracellular signal transduction.
	Phosphoinositide 3-Kinase (PI3K)	PI3K inhibitors	Acts on a key signaling pathway downstream of multiple receptors.

The following diagram illustrates the signaling pathways identified as potential therapeutic targets for RUNX1-FPD, highlighting the points of intervention for small molecule inhibitors.

Experimental Protocols

Protocol 1: Fine-tuning an scFM for RUNX1-FPD ISP

This protocol describes the initial fine-tuning of a pre-trained single-cell foundation model to establish a baseline for in silico perturbation predictions in RUNX1-FPD.

Materials:

Pre-trained scFM (e.g., Geneformer-30M-12L)
scRNA-seq dataset of RUNX1-knockout HSCs
scRNA-seq dataset of isogenic control HSCs (e.g., targeting AAVS1 safe harbor locus)
High-performance computing (HPC) cluster with GPU acceleration

Procedure:

Data Preprocessing: Ensure both knockout and control HSC datasets are normalized and log-transformed. Map gene identifiers to match the pre-trained model's vocabulary.
Model Setup: Load the pre-trained Geneformer model. Configure the final classification layer to output a binary classification (RUNX1-knockout vs. control).
Fine-tuning:
- Freeze the initial layers of the transformer model (e.g., first 6 layers) to retain general biological knowledge.
- Train the unfrozen layers on the HSC classification task.
- Use a balanced dataset to avoid bias toward either cell state.
- Employ a standard cross-entropy loss function and the AdamW optimizer with a learning rate of 5e-5.
- Train until validation accuracy plateaus (typically 10-20 epochs).
Validation: Evaluate the fine-tuned model on a held-out test set of HSCs. A successful model should achieve >99% accuracy in distinguishing RUNX1-knockout from control cells [9].

Protocol 2: Executing and Validating Closed-Loop ISP

This protocol details the iterative process of generating ISP predictions, experimentally testing them, and refining the model.

Materials:

Fine-tuned scFM from Protocol 1.
Primary human T-cells or engineered HSCs for perturbation experiments.
CRISPRa/CRISPRi libraries for candidate genes.
Facilities for single-cell RNA sequencing (e.g., 10x Genomics platform).

Procedure:

Initial (Open-loop) ISP:
- Use the fine-tuned model from Protocol 1 to perform in silico knockout simulations for all protein-coding genes (~13,000 genes).
- Rank genes based on the predicted magnitude of shift from RUNX1-knockout state toward the control state.
- Cross-reference ISP predictions with differential expression (DE) analysis between RUNX1-knockout and control HSCs to generate a high-confidence candidate list.

Experimental Perturbation:
- Select the top 20-50 candidate genes from the high-confidence list.
- Design and clone a CRISPRa/i library targeting these genes.
- Transduce primary human T-cells or engineered HSCs with the perturbation library.
- Perform scRNA-seq (Perturb-seq) on the perturbed cell population. Label the cells only with their activation/state status, not the perturbed gene [9].
Closed-loop Fine-tuning:
- Combine the original HSC training data with the new scRNA-seq data from the Perturb-seq experiment.
- Re-fine-tune the pre-trained scFM (or the model from Protocol 1) on this combined dataset, using the same binary classification objective.
- This step teaches the model the empirical consequences of real-world perturbations.
Refined (Closed-loop) ISP:
- Run the ISP again using the newly fine-tuned model.
- The refined predictions will show improved accuracy and a higher positive predictive value, more effectively prioritizing true therapeutic targets like mTOR and the CD74-MIF axis [9].

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Research Reagents and Computational Tools for Closed-Loop ISP

Category / Item	Specific Example(s)	Function and Application
Single-Cell Foundation Models	Geneformer-30M-12L, scGPT, scFoundation [37]	Pre-trained models providing the base for fine-tuning and ISP tasks on single-cell data.
Computational Framework	Closed-loop ISP custom code (PyTorch) [9]	Software environment for model fine-tuning, running in silico perturbations, and integrating new data.
RUNX1-FPD Cell Model	Human HSCs with RUNX1 loss-of-function (CRISPR/Cas9) [9]	Biologically relevant in vitro system to model the disease and validate predictions.
Perturbation Screening Tool	CRISPR activation/interference (CRISPRa/i) with Perturb-seq [9]	Technology for experimentally perturbing candidate genes and measuring genome-wide effects at single-cell resolution.
Key Therapeutic Inhibitors	mTOR inhibitors, MIF inhibitors, PKC inhibitors, PI3K inhibitors [9]	Small molecules used for functional validation of predicted therapeutic targets in vitro and in vivo.

The application of the closed-loop ISP framework to RUNX1-Familial Platelet Disorder represents a significant advancement in computational biology and rare disease research. By iteratively refining a single-cell foundation model with empirical data from targeted perturbations, this approach transforms the "virtual cell" from a static predictor into a dynamic, learning system. The method successfully identified several high-priority therapeutic targets, including the mTOR and CD74-MIF signaling axes, demonstrating the potential of AI-driven in silico discovery to accelerate the development of much-needed interventions for patients with this high-risk predisposition syndrome. This closed-loop paradigm is broadly applicable to a wide range of other genetic diseases, heralding a new era where computational models and experimental biology are tightly integrated to decipher and treat complex medical conditions.

Application Note: Landscape and Benchmarking of Computational Models

The accurate in silico prediction of combinatorial genetic perturbation effects represents a cornerstone for advancing functional genomics and therapeutic discovery. Within the broader thesis of in silico perturbation modeling using single-cell Foundation Models (scFMs), this application note details the current computational landscape, performance benchmarks, and standardized protocols for modeling these complex biological interactions. The ability to simulate genetic interactions and synergistic drug effects enables researchers to prioritize experimental work, elucidate functional genetic networks, and identify novel therapeutic combinations with reduced experimental burden.

Current Model Landscape and Performance

Recent benchmarking studies reveal a critical insight: despite their architectural complexity, many deep-learning foundation models do not consistently outperform deliberately simple linear baselines in predicting transcriptome-wide perturbation outcomes [11]. The field is rapidly evolving, with new architectures like the Large Perturbation Model (LPM) showing promise by explicitly disentangling Perturbation, Readout, and Context (PRC) dimensions, thereby enabling the integration of heterogeneous experimental data across diverse perturbations (e.g., CRISPR, chemical), readouts (e.g., transcriptomics, viability), and biological contexts [4].

Table 1: Benchmarking of Perturbation Prediction Models summarizes quantitative performance comparisons across key methodologies. Performance is typically measured using the Pearson correlation between predicted and observed gene expression values for held-out perturbations.

Table 1: Benchmarking of Perturbation Prediction Models

Model	Model Type	Key Innovation	Reported Performance (Pearson r)	Data Modalities Supported
Large Perturbation Model (LPM) [4]	PRC-Disentangled Deep Learning	Disentangles Perturbation, Readout, Context	State-of-the-art (exact values not provided)	Genetic & Chemical; Transcriptomics & Viability
GPerturb [38]	Gaussian Process Regression	Sparse, interpretable effects with uncertainty estimates	0.981 (Replogle), 0.979 (Norman)	Single-cell CRISPR screens (count & continuous data)
CPA [11]	Autoencoder	Predicts combinatorial & dose-dependent effects	Outperformed by linear baselines in double perturbation [11]	Continuous expression, dosages
GEARS [11]	Graph-Enhanced Deep Learning	Incorporates Gene Ontology knowledge graphs	Outperformed by linear baselines [11]	Discrete genetic perturbations
scGPT / Geneformer [11]	Single-cell Foundation Models	Transformer-based pretrained on scRNA-seq data	Did not outperform simple additive baseline [11]	Transcriptomics
Additive Baseline [11]	Simple Linear Model	Sum of individual logarithmic fold changes (LFCs)	Benchmark for double perturbations [11]	Gene expression

A significant challenge in the field is the prediction of genetic interactions, where the effect of a double perturbation deviates from the expected combination of single effects. In a benchmark using data from Norman et al., which included 124 double gene perturbations in K562 cells, models like GEARS, scGPT, and scFoundation were unable to outperform a simplistic "no change" or "additive" baseline in identifying these interactions [11]. Furthermore, most models demonstrated a strong bias towards predicting "buffering" interactions and were notably poor at identifying the rarer "synergistic" interactions correctly [11].

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Research Reagents and Computational Tools

Reagent / Tool	Function / Description	Application in Perturbation Modeling
CROP-seq / Perturb-seq [38]	Single-cell RNA-seq technology coupling CRISPR perturbations with transcriptomic readouts.	Generates high-throughput training and validation data for models.
LINCS Datasets [4]	Library of Integrated Network-Based Cellular Signatures; contains genetic and pharmacological perturbation data.	Used for training cross-modal models like LPM.
PhenotypeGenetics Software [39]	Open-source, cross-platform software for deriving genetic-interaction networks from quantitative phenotype data.	Computationally assigns interaction modes from phenotype inequalities.
Gene Ontology (GO) Annotations [11]	Structured, controlled vocabularies of gene and gene product attributes.	Used by models like GEARS to inform gene relationships for predicting unseen perturbations.
DrugComboRanker / AuDNNsynergy [40]	AI-based algorithms for predicting synergistic and antagonistic drug combinations.	Applied in multi-omics drug discovery for anti-cancer and antimicrobial therapy optimization.

Experimental Protocols

Protocol 1: Benchmarking Model Performance on Double Genetic Perturbations

Purpose: To objectively evaluate the performance of a new perturbation prediction model against established baselines using a standardized dataset of single and double genetic perturbations.

Background: This protocol is adapted from benchmarks performed in [11], which highlighted the critical importance of comparing against simple baselines.

Materials:

Hardware: Workstation with GPU (e.g., NVIDIA A100) for deep learning model training.
Software: Python environment with relevant model implementations (e.g., scGPT, GEARS, CPA).
Dataset: Norman et al. dataset [11], comprising single-cell RNA-seq data for 100 single-gene and 124 double-gene CRISPRa perturbations in K562 cells.

Procedure:

Data Preprocessing:
- Download and preprocess the Norman et al. dataset to match the input requirements of the models being tested.
- Normalize and log-transform gene expression counts.
- Focus analysis on the 1,000 most highly expressed genes or a similarly defined subset for computational efficiency and noise reduction.

Experimental Setup:
- Partition the data into training and test sets. The training set must include all 100 single perturbations and a random subset of 62 double perturbations. The remaining 62 double perturbations are held out for testing.
- Repeat this partitioning five times with different random seeds to ensure robustness of the results.
Model Training and Fine-tuning:
- Train or fine-tune the candidate deep learning models (e.g., scGPT, GEARS) on the training set according to their published protocols.
- Simultaneously, generate predictions using the two simple baselines:
  - No Change Baseline: Predicts the control condition expression for all perturbations.
  - Additive Baseline: For a double perturbation A+B, predicts the sum of the LFCs of single perturbations A and B.
Performance Evaluation:
- For each model and baseline, calculate the prediction error as the L2 distance between the predicted and observed expression values for all genes in the test set.
- Compare the prediction errors of the complex models against the simple baselines. A superior model must consistently achieve a lower error than the baselines across all five splits.
- To evaluate genetic interaction prediction, identify significant genetic interactions in the ground truth data (e.g., deviations from additivity at a 5% FDR). Calculate the True-Positive Rate (TPR) and False Discovery Proportion (FDP) for the model's interaction predictions across various thresholds.

Troubleshooting:

If a complex model underperforms the additive baseline, this indicates the model may not be effectively learning the underlying interaction biology from the data [11].
Ensure that the data preprocessing and normalization steps are consistent across all models to allow for a fair comparison.

Protocol 2: Training and Applying the Large Perturbation Model (LPM)

Purpose: To train an LPM for multi-task biological discovery, including predicting effects of unseen perturbations and mapping shared mechanisms between chemical and genetic perturbations.

Background: The LPM architecture integrates heterogeneous data by treating Perturbation (P), Readout (R), and Context (C) as separate, disentangled conditioning variables [4].

Materials:

Datasets: Pooled perturbation data from sources like LINCS [4], encompassing multiple cell lines, perturbation types (CRISPR, chemical), and readouts (transcriptomics, cell viability).

Procedure:

Data Integration and Tokenization:
- Compile a diverse set of perturbation experiments into a unified schema.
- Represent each experiment as a (P, R, C) tuple, where P is the perturbation identity (e.g., gene target, compound name), R is the readout feature (e.g., gene symbol, viability metric), and C is the biological context (e.g., cell line, tissue type).

Model Training:
- Train the decoder-only LPM architecture to predict the quantitative outcome of a given (P, R, C) combination.
- The model learns joint representations in latent spaces for perturbations, readouts, and contexts.
Model Application for Discovery:
- Predicting Unseen Perturbations: Query the model with a novel (P, R, C) combination not present in the training data to simulate an in silico experiment.
- Mapping a Compound-CRISPR Shared Space: Extract the learned perturbation embeddings and visualize them using dimensionality reduction (e.g., t-SNE). Analyze the clustering of pharmacological inhibitors and CRISPR perturbations targeting the same gene to validate the model's biological plausibility [4].
- Identifying Anomalies: Investigate compounds that cluster distantly from their putative genetic targets, as these may indicate off-target activities or novel mechanisms, a finding substantiated by LPM analysis of drugs like pravastatin [4].

Protocol 3: Quantifying Genetic Interactions from Phenotype Data

Purpose: To systematically classify genetic interaction modes from quantitative phenotype data using the PhenotypeGenetics framework.

Background: This classical, computation-based method defines genetic interactions through inequalities between the phenotypes of wild-type, single-mutant, and double-mutant genotypes [39]. It provides a model-agnostic way to establish ground truth for interactions.

Materials:

Software: PhenotypeGenetics plugin for Cytoscape [39].
Data: Quantitative phenotype measurements (e.g., growth rate, expression of a marker, filamentous growth in yeast) for wild-type, single-gene perturbed, and double-gene perturbed genotypes, with associated error estimates.

Procedure:

Data Collection and Error Estimation:
- For each genotype (WT, A, B, AB), collect quantitative phenotype measurements in multiple replicates.
- Calculate the mean phenotype value and the associated error range for each genotype.

Classification of Interaction Modes:
- For each pair of perturbations (A, B), establish a phenotypic order by comparing the mean phenotypes and their error-bounded intervals.
- Assign one of nine exclusive genetic interaction modes based on the inequality relationships [39]:
  - Familiar Modes: Non-interactive, Epistatic, Synthetic, Conditional, Suppressive, Additive.
  - Less Familiar Modes: Asynthetic (A, B, AB have same deviant phenotype), Single-nonmonotonic, Double-nonmonotonic.
- Asymmetric interactions (e.g., Epistasis) are represented as directed edges in a network.
Network Construction and Analysis:
- Use PhenotypeGenetics to construct a genetic interaction network where nodes are perturbations and edges represent the classified interaction modes.
- Analyze the network for local and global patterns. Identify "monochromatic" patterns where a specific perturbation interacts with multiple genes in the same biological process via the same interaction mode, thus validating the functional relevance of the findings [39].

Workflow and Pathway Visualizations

In Silico Perturbation Modeling Workflow

PRC-Disentangled Model Architecture

Genetic Interaction Classification Modes

The drug discovery landscape for rare diseases is fraught with challenges, including small patient populations, limited access to biological samples, and often poorly understood pathophysiology [41]. In silico technologies, particularly single-cell foundation models (scFMs), are emerging as powerful tools to overcome these barriers by enabling the prediction of cellular responses to genetic and chemical perturbations. These "virtual cell" models provide a scalable, human-relevant platform for identifying and prioritizing therapeutic targets, especially where experimental screening with scarce patient samples is unfeasible [9]. This Application Note details protocols for leveraging in silico perturbation modeling to accelerate target identification and validation for rare diseases, providing a structured framework for researchers and drug development professionals.

In Silico Technologies in Rare Disease Research: Contexts of Use

Computational approaches are being deployed across the rare disease research and development continuum. The table below summarizes the key contexts of use (CoUs) for in silico technologies, highlighting their specific applications and the methodologies employed.

Table 1: Contexts of Use for In Silico Technologies in Rare Disease Research

Context of Use (CoU)	Primary Applications	Representative Methodologies
Diagnosis & Disease Characterization (CoU1)	Variant interpretation, phenotype mining, disease stratification [41]	AI-enhanced genomic pipelines (e.g., popEVE), NLP-EHR analysis, deep learning for pathogenicity prediction [41] [42]
Drug Discovery (CoU2)	Target identification/prioritization, virtual screening, drug repurposing [41]	Network pharmacology, AI-led target ID (e.g., PandaOmics), molecular docking, QSAR modeling [41] [43]
Preclinical Development (CoU3)	Disease mechanism modeling, biomarker nomination, efficacy prediction [41]	Single-cell Foundation Models (scFMs), Quantitative Systems Pharmacology (QSP), organoid-ML simulations [41] [9]
Clinical Trial Design (CoU4)	Virtual trials, synthetic control arms, pharmacokinetic/pharmacodynamic (PK/PD) modeling [41]	Pharmacometric models, PBPK, virtual patient cohort simulation [41]

Single-Cell Foundation Models for Perturbation Modeling

Single-cell foundation models (scFMs), such as Geneformer, are deep learning models pre-trained on vast amounts of single-cell transcriptomics data [9]. They can be fine-tuned for specific tasks, including in silico perturbation (ISP), which predicts how a genetic perturbation (e.g., gene knockout or overexpression) would alter a cell's transcriptomic state [9]. A critical advancement is the "closed-loop" framework, where the model iteratively incorporates experimental perturbation data during fine-tuning to significantly improve prediction accuracy [9].

Experimental Protocol: Benchmarking and Implementing a Closed-Loop ISP Workflow

This protocol outlines the steps for fine-tuning a scFM and implementing a closed-loop ISP to identify therapeutic targets for a rare disease.

I. Model Fine-Tuning for Disease State Classification

Objective: Adapt a pre-trained scFM (e.g., Geneformer) to distinguish between diseased and healthy control cell states.
Procedure:
- Data Acquisition: Generate or source single-cell RNA sequencing (scRNA-seq) data from patient-derived cells or a validated engineered in vitro model of the rare disease (e.g., RUNX1-knockout hematopoietic stem cells for RUNX1-Familial Platelet Disorder) [9]. Include matched control data.
- Data Preprocessing: Ensure data is formatted to the input specifications of the chosen scFM. This typically includes quality control, normalization, and gene count matrix generation.
- Fine-Tuning: Fine-tune the pre-trained scFM model using the disease and control scRNA-seq data. The goal is for the model to learn to classify a cell's state accurately. Performance should be validated on a held-out test set of cells [9].

II. Open-Loop In Silico Perturbation Screening

Objective: Perform an initial, unbiased screen for genes that, when perturbed, shift the diseased cell state toward a healthy state.
Procedure:
- Perturbation Simulation: Using the fine-tuned model, perform ISP across a wide panel of genes (e.g., simulating knockout or overexpression for each gene in the model).
- Output Analysis: The model will output a predicted transcriptomic profile for each perturbation. Identify genes for which the predicted profile shifts the diseased cells significantly toward the control state [9].
- Triangulation: Cross-reference ISP predictions with results from differential expression (DE) analysis between disease and control cells. Genes identified by both methods represent high-confidence candidates [9].

III. Closing the Loop: Model Enhancement with Experimental Data

Objective: Dramatically improve model accuracy by incorporating experimental perturbation data.
Procedure:
- Experimental Validation: Select a subset of top candidate genes from the open-loop screen for experimental validation using a method like Perturb-seq (CRISPR-based perturbation coupled with scRNA-seq).
- Iterative Fine-Tuning: Use the experimentally generated scRNA-seq data from the perturbations (labeled with the resulting cell state, e.g., "more disease-like" or "more control-like") to further fine-tune the scFM.
- Final Prediction: Run the closed-loop (refined) ISP model to generate a refined list of high-priority therapeutic targets. Research indicates this can increase the positive predictive value (PPV) of predictions three-fold [9].

The following workflow diagram illustrates this closed-loop experimental protocol:

Critical Appraisal of Model Performance and Simple Baselines

While scFMs hold immense promise, a critical appraisal of their performance against simpler models is essential for robust experimental design. Recent benchmarking studies reveal that the performance of complex deep-learning models for predicting perturbation effects is highly context-dependent.

Table 2: Model Performance Comparison for Perturbation Prediction

Model / Baseline	Reported Performance	Context and Notes
Closed-loop scFM (Geneformer)	3x increase in Positive Predictive Value (PPV) vs. open-loop (from 3% to 9%); High NPV (99%), Sensitivity (76%), Specificity (81%) [9]	Applied to T-cell activation and RUNX1-FPD; Performance improved with just ~20 perturbation examples [9].
Open-loop scFM (Geneformer)	PPV: 3%; Negative Predictive Value (NPV): 98%; Sensitivity: 48%; Specificity: 60% [9]	Outperformed differential expression (DE) analysis for NPV, sensitivity, and specificity [9].
'Additive' Baseline Model	Lower prediction error (L2 distance) than 5 foundation models and 2 other deep learning models for predicting double perturbation effects [11]	Predicts double perturbation effects as the sum of individual logarithmic fold changes. Used no double perturbation data for training [11].
'No Change' Baseline Model	Performance equivalent or superior to deep learning models in predicting genetic interactions from double perturbations [11]	Always predicts the same expression as in the control condition.
Simple Linear Model	Outperformed or matched deep learning models in predicting effects of unseen single-gene perturbations [11]	Uses dimension-reducing embeddings of training data; performance can be enhanced with embeddings from foundation models [11].

Experimental Protocol: Benchmarking scFM Performance

Objective: Compare the performance of a scFM against simple baselines for a specific prediction task.
Procedure:
- Dataset Selection: Select a published single-cell perturbation dataset with both single and double genetic perturbations (e.g., Norman et al. or Replogle et al. datasets) [11].
- Model Training: Fine-tune the scFM on a portion of the single perturbations and a subset of double perturbations.
- Baseline Implementation: Implement the 'additive' and 'no change' baselines as described in Table 2.
- Testing: Evaluate all models on a held-out test set of double perturbations.
- Metric Calculation: Calculate performance metrics, including L2 distance between predicted and observed expression values for the top highly expressed genes, and the ability to predict genetic interactions (true-positive rate vs. false discovery proportion) [11].

The Scientist's Toolkit: Research Reagent Solutions

The following table details key reagents and computational tools essential for conducting in silico perturbation studies for rare diseases.

Table 3: Essential Research Reagents and Tools for In Silico Perturbation Modeling

Item / Resource	Function / Application	Example / Note
Pre-trained scFM	Provides a foundational understanding of gene-gene relationships from vast single-cell data; base for task-specific fine-tuning.	Geneformer [9], scGPT [11], scFoundation [11].
Rare Disease Model scRNA-seq Data	Essential dataset for fine-tuning the scFM to recognize the specific disease pathophysiology.	Patient-derived cells or genetically engineered in vitro models (e.g., RUNX1-knockout HSCs) [9].
Perturb-seq Data	Gold-standard experimental data used to "close the loop" and ground-truth model predictions, drastically improving accuracy.	CRISPR-based perturbation coupled with scRNA-seq [9].
AI-Based Pathogenicity Predictor	Aids in initial variant prioritization and diagnosis (CoU1), helping to define the genetic basis of the rare disease.	popEVE model scores variants by disease likelihood [42].
Network Analysis Platform	Identifies novel therapeutic targets and supports drug repurposing by analyzing interactions within biological systems.	PandaOmics for ALS [41], STRING, Cytoscape [41].
Linear Model Baselines	Critical for benchmarking the performance of more complex deep learning models; ensures reported advances are meaningful.	'Additive' and 'No Change' models, simple linear regression with embeddings [11].

Integrated Signaling Pathways Identified via In Silico Perturbation

Applying the closed-loop ISP framework to rare diseases like RUNX1-Familial Platelet Disorder (RUNX1-FPD) can identify key dysregulated signaling pathways. The model nominates specific genes within these pathways whose perturbation can shift the diseased state toward normal, highlighting them as potential therapeutic targets [9].

The diagram below summarizes the key signaling pathways and candidate therapeutic targets identified for RUNX1-FPD using this approach:

Rare diseases and research involving challenging primary patient samples present a major obstacle in biomedical research: the profound scarcity of biological material. This scarcity limits the application of traditional high-throughput screening methods for target discovery and therapeutic development. The emergence of in silico perturbation modeling, particularly using single-cell Foundation Models (scFMs), provides a powerful framework to overcome these limitations. These technologies enable the virtual simulation of cellular and molecular responses to genetic or chemical perturbations, dramatically reducing the experimental burden on precious samples [44]. This Application Note details protocols for employing these computational strategies to conduct virtual screens and derive biologically meaningful insights from limited datasets, thereby accelerating research for rare conditions and complex diseases.

Computational Frameworks for In Silico Perturbation

Several advanced computational frameworks now enable the prediction of cellular responses to perturbations. The choice of model depends on the type of available data and the specific biological question. The core capability of these models is to learn the underlying "rules" of cellular biology from large-scale existing data and apply them to a specific, data-scarce context of interest.

The Large Perturbation Model (LPM)

The Large Perturbation Model (LPM) is a deep-learning architecture specifically designed to integrate heterogeneous perturbation experiments. Its key innovation is the disentanglement of the Perturbation (P), Readout (R), and experimental Context (C) into separate dimensions [45].

Architecture and Workflow: LPM is trained to predict the outcome of a perturbation experiment based on a symbolic representation of the (P, R, C) tuple. It uses a decoder-only architecture that does not explicitly encode observations, allowing it to learn perturbation-response rules that are disentangled from the specific context in which they were initially observed [45].
Application to Rare Diseases: Once trained on a pooled dataset encompassing diverse perturbations, readouts, and cell types, an LPM instance can be queried to predict the effects of unseen perturbations—including drugs—on a rare disease cell type, even if that specific cell type was not heavily represented in the training data. This allows for the virtual screening of compound libraries against challenging cell types.

The following diagram illustrates the core architecture and workflow of an LPM for in silico discovery.

Image-Based Perturbation Prediction with IMPA

For research involving high-content imaging, the IMage Perturbation Autoencoder (IMPA) offers a solution for predicting morphological responses. IMPA is a generative style-transfer model that decomposes a cell image into a content component (the cell's identity) and a style component (the perturbation effect) [46].

Protocol: Predicting Morphological Drug Responses
- Input: Acquire high-content microscopy images (e.g., from Cell Painting assays) of untreated cells from your scarce sample.
- Model Application: Use IMPA to transfer the "style" of a desired drug perturbation onto your control cell images, generating in-silico-treated images.
- Analysis: Extract morphological features from the generated images to predict the phenotypic impact of the drug, enabling prioritization for subsequent physical testing [46].

Community Benchmarks and Virtual Cell Models

The field is moving towards standardized benchmarking to accelerate development. Initiatives like the Arc Institute's Virtual Cell Challenge provide community-wide competitions to stress-test models on their ability to generalize to new cell contexts and predict the effects of single gene perturbations [47]. Furthermore, the concept of a "Virtual Cell" extends beyond prediction to include the explanation of underlying mechanisms and the discovery of novel biology, forming a Predict-Explain-Discover (P-E-D) cycle that is highly valuable for drug discovery [48].

Table 1: Comparison of In Silico Perturbation Modeling Frameworks

Framework	Core Architecture	Input Data Modality	Primary Output	Key Advantage for Sample Scarcity
Large Perturbation Model (LPM) [45]	PRC-disentangled, decoder-only deep learning	Transcriptomics, Viability	Predicted post-perturbation readout (e.g., gene expression)	Integrates data from diverse contexts; predicts for unseen perturbations.
IMPA [46]	Conditional Generative Adversarial Network (GAN)	High-content microscopy images	Synthetic image of perturbed cell	Predicts morphological effects without needing paired before/after image data.
scGPT / Geneformer [45]	Transformer-based encoder	Single-cell transcriptomics	Cell and gene embeddings	Can be fine-tuned on small datasets for context-specific predictions.
VirtuDockDL [49]	Graph Neural Network (GNN)	Chemical structures (SMILES)	Predicted binding affinity / activity	Accelerates virtual screening of compound libraries against a protein target.

Protocols for Target Discovery and Virtual Screening

Protocol 1: Machine Learning-Driven Virtual Screening for Novel Inhibitors

This protocol is designed to identify potential drug candidates for a target of interest (e.g., a protein implicated in a rare disease) using a machine learning (ML)-based classifier, minimizing the need for wet-lab screening until the final stages [50] [51].

Step 1: Data Curation and Preprocessing
- Active Compounds: Curate a set of known active inhibitors for your target from public databases like ChEMBL or BindingDB [50] [51]. Label these compounds as 1.
- Decoy Compounds: Generate a set of physicochemically similar but presumed inactive molecules (decoys) from resources like the Directory of Useful Decoys-Enhanced (DUD-E) [50]. Label these as 0. To mitigate bias, consider alternative decoy strategies such as using Dark Chemical Matter (DCM) or random selections from the ZINC15 database [52].
Step 2: Feature Engineering and Model Training
- Compute Molecular Descriptors: Use cheminformatics toolkits like RDKit to compute 2D molecular descriptors (e.g., molecular weight, LogP, topological polar surface area) from the SMILES strings of all compounds [50] [51].
- Train ML Classifiers: Train multiple ML models (e.g., Random Forest (RF), Support Vector Machine) on the labeled dataset. Evaluate models using tenfold cross-validation and metrics like accuracy, specificity, and the area under the curve (AUC). The Random Forest model often outperforms others in this task, achieving accuracies >93% in published studies [50] [51].
Step 3: Virtual Screening and Hit Identification
- Screen Compound Library: Apply the trained and validated model to screen a large library of uncharacterized compounds (e.g., a virtual library of phytochemicals or drug-like molecules). The model will predict the probability of each compound being active.
- Prioritize Hits: Select the top-ranking compounds predicted to be active. Filter these hits based on drug-likeness (e.g., Lipinski's Rule of Five) and other desired properties [50].

The workflow for this integrated computational screening process is summarized below.

Protocol 2: Leveraging LPM for Compound Repurposing

This protocol uses a pre-trained LPM to identify existing drugs that might be effective against a rare disease cell type.

Step 1: Model Querying
- Define the perturbation (P) as a library of approved drugs or a specific compound of interest.
- Define the readout (R) as a transcriptomic profile or cell viability measurement.
- Define the context (C) as the specific challenging cell type or patient-derived sample.
Step 2: Analysis of Predictions
- The LPM will output a predicted post-perturbation state for each (P, R, C) tuple.
- Analyze the predictions to identify compounds that drive the gene expression profile of the diseased cells towards a healthier state or that significantly reduce predicted cell viability in cancer contexts.
- The LPM's latent space can be used to identify shared mechanisms of action between genetic and chemical perturbations, providing explanatory power for the predictions [45].

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Computational Tools and Databases

Tool / Database	Type	Primary Function in Protocol	Reference/Access
RDKit	Cheminformatics Software	Computes molecular descriptors and fingerprints from SMILES strings for ML model training.	[49] [50]
ChEMBL / BindingDB	Bioactivity Database	Source of known active compounds for a target; used to create labeled training data for ML models.	[50] [51]
Directory of Useful Decoys-Enhanced (DUD-E)	Decoy Compound Database	Provides physicochemically matched, presumed inactive molecules to balance ML training sets.	[50] [52]
ZINC15	Commercial Compound Database	Source of purchasable, drug-like molecules for virtual screening and decoy selection.	[52]
scikit-learn	Machine Learning Library	Provides implementations of Random Forest, SVM, and other algorithms for building classifiers.	[50] [51]
PADIF Fingerprint	Protein-Ligand Interaction Descriptor	Used to train target-specific ML scoring functions to improve virtual screening power.	[52]
Arc Virtual Cell Atlas	Transcriptomics Data Repository	Large-scale single-cell dataset for pre-training or fine-tuning perturbation models.	[47]

Validation and Downstream Workflow

While in silico models significantly de-risk experimentation, their predictions require rigorous validation before concluding.

In Silico Validation: For ML-based virtual screening, use tenfold cross-validation and metrics like Area Under the Curve (AUC) to assess model performance on held-out test data. A high AUC (>0.98 has been reported with RF models) indicates strong screening power [50].
Experimental Validation: Top-ranked hits from virtual screens must be validated experimentally.
- In Vitro Assays: Use cell viability assays (e.g., for oncology targets) or enzymatic activity assays to confirm biological activity on the rare cell type, using the limited available samples in a highly targeted manner.
- Binding Confirmation: Employ techniques like Surface Plasmon Resonance (SPR) or conduct molecular docking and molecular dynamics (MD) simulations to study the stability of the ligand-protein complex and confirm binding affinity predictions [49] [50].

Navigating Challenges and Limitations in scFM-Based Perturbation Modeling

Addressing the Non-Sequential Nature of Gene Expression Data

A fundamental challenge in applying artificial intelligence to single-cell genomics lies in the non-sequential nature of gene expression data. Unlike natural language, where words follow a deterministic order, or images, where pixels have spatial relationships, the genes within a cell's transcriptome have no inherent sequence. This creates a significant obstacle for transformer-based architectures and other sequential models that require structured input. This Application Note outlines standardized protocols for processing, tokenizing, and analyzing non-sequential gene expression data within in silico perturbation modeling frameworks using single-cell Foundation Models (scFMs). The methodologies described herein enable researchers to transform unordered gene vectors into structured inputs suitable for advanced AI models, thereby facilitating more accurate predictions of cellular responses to genetic and chemical perturbations.

Table: Core Challenges of Non-Sequential Gene Expression Data

Challenge	Description	Impact on Modeling
Lack of Native Ordering	Genes in expression vectors have no biological sequence.	Direct application of sequential models (e.g., transformers) is invalid.
Dimensionality	Profiling typically measures 20,000+ genes per cell.	Computationally intensive; requires robust feature selection.
Batch Effects	Technical variations between experiments.	Introduces spurious correlations; hinders model generalization.

Tokenization Strategies for Unordered Genomic Data

Tokenization is the critical process of converting raw gene expression data into discrete units (tokens) that scFMs can process. Since genes lack a natural sequence, a deterministic ordering must be imposed. The following protocol details the primary strategies identified in the literature for this purpose [1] [5].

Protocol: Deterministic Gene Ranking by Expression

This protocol creates a cell-specific sequence by ranking genes based on their expression magnitude, which serves as a consistent and biologically informative ordering system.

Input: Raw or normalized count matrix (Cells × Genes).
Quality Control: Filter out low-abundance genes and low-quality cells using standard scRNA-seq preprocessing tools (e.g., Scanpy).
Gene Ranking:
- For each cell, sort all detected genes in descending order of their expression value.
- The most highly expressed gene is assigned the first token position, the second most expressed the next, and so on.
Sequence Truncation/Padding:
- Define a fixed sequence length L (e.g., 1200 genes) based on model requirements and computational constraints.
- For cells with more than L detected genes, retain only the top L ranked genes.
- For cells with fewer than L detected genes, pad the sequence with a special <PAD> token or mask.
Token Embedding: Each gene in its ranked position is converted into a token. The token embedding typically combines:
- A gene identity embedding (a unique vector for each gene, learned or pre-trained).
- The expression value (or a binned representation of it).
- A positional encoding to inform the model of the gene's assigned rank in the sequence.

Alternative Tokenization Methods

Expression Binning: Genes are partitioned into bins based on their expression values (e.g., high, medium, low), and the sequence is formed based on bin membership [1] [53].
Fixed Gene Ordering: A universal order is established based on external information, such as genomic coordinates or a pre-defined list, and is applied to all cells. This is less common due to its biological arbitrariness.

The following diagram illustrates the workflow for the gene ranking tokenization strategy.

Experimental Framework for Model Benchmarking

To evaluate the efficacy of different tokenization and modeling approaches in handling non-sequential data for perturbation tasks, a robust benchmarking framework is essential. The following protocol utilizes the BioLLM framework to ensure standardized and reproducible comparisons [5].

Protocol: Benchmarking scFMs with BioLLM

This protocol outlines the steps for performing a comparative analysis of different scFMs on a standardized perturbation dataset.

Environment and Data Setup
- Install the BioLLM framework, which provides a unified interface for models like scGPT, Geneformer, and scBERT.
- Load a canonical perturbation dataset (e.g., Kang 2018 dataset of IFN-β stimulated PBMCs) using BioLLM's data loader [54].
- Standardize the dataset by renaming observation keys (e.g., adata.obs['label'] to adata.obs['condition']).
Model Initialization and Configuration
- Initialize multiple scFMs (e.g., scGPT, Geneformer, scFoundation, scBERT) through the BioLLM's foundation model loader.
- Configure a random_forest_classifier or similar estimator as the perturbation predictor for tasks like cell type prioritization.
Feature Selection and Training
- Employ the model's native tokenization strategy (e.g., gene ranking for scGPT).
- For baseline comparison, use alternative feature selection methods:
  - select_variance_feature=True: Uses the original Augur variance-based selection.
  - scanpy.pp.highly_variable_genes: Uses Scanpy's method for faster, potentially inflated performance.
- Train all models with identical training/validation/test splits (e.g., 70/20/10) and fixed hyperparameters.
Performance Evaluation
- Extract cell embeddings from each model in a zero-shot setting.
- Calculate the Average Silhouette Width (ASW) to assess the biological relevance and batch-effect removal capacity of the embeddings.
- For perturbation prediction, calculate the Area Under the Curve (AUC) to measure how well a model can predict the treatment condition (e.g., stimulated vs. control) for each cell type.

Table: Benchmarking Results of scFMs on Exemplar Tasks (Adapted from BioLLM [5])

Model	Zero-shot Embedding Quality (ASW)	Perturbation Prediction (AUC)	Computational Efficiency
scGPT	0.85 (Consistently highest)	0.92	High (Optimized memory/time)
Geneformer	0.78 (Strong on gene-level tasks)	0.87	High
scFoundation	0.75	0.84	Moderate
scBERT	0.65 (Lags behind peers)	0.79	Low

Application to In Silico Perturbation Modeling

Once a model has processed gene expression data into a structured format, it can be powerfully applied to predict the effects of perturbations. The following protocols detail this for two key tasks.

Protocol: Predicting Cell-Type Specific Perturbation Responses with Augur

This protocol uses the Augur method to identify which cell types within a heterogeneous sample are most affected by a perturbation, based on the separability of their transcriptomic profiles [54].

Input Preparation: Use a standardized single-cell dataset with perturbation labels (e.g., control and stimulated) and cell type annotations.
Model Loading: Initialize an Augur object in Pertpy with a specified classifier (e.g., ag_rfc = pt.tl.Augur("random_forest_classifier")).
Data Formatting: Load the AnnData object into Augur, specifying the label_col (perturbation condition) and cell_type_col.
Training and Prediction: Run the predict function with parameters like subsample_size=20 (number of cells per type) and n_threads=4 for parallelization. Use select_variance_features=True for high-resolution results.
Interpretation: The output is a prioritization of cell types by mean_augur_score (derived from AUC). Higher scores indicate cell types whose transcriptional state is more profoundly altered by the perturbation.

Protocol: Counterfactual Prediction with GPerturb

GPerturb is a Gaussian process-based model that estimates sparse, interpretable gene-level perturbation effects, providing uncertainty estimates for its predictions [55].

Data Preprocessing: Format your single-cell perturbation data (e.g., from Perturb-seq). GPerturb can handle both raw counts (GPerturb-ZIP) and continuous transformed data (GPerturb-Gaussian).
Model Training: GPerturb uses a supervised learning approach to disentangle basal expression (cell type-specific) from additive perturbation effects. The model is trained on a subset of data (e.g., 80% of cells).
Prediction and Analysis:
- Input a target cell state and a perturbation of interest.
- The model generates a counterfactual prediction—the gene expression profile of the target cell had it been subjected to the perturbation.
- Analyze the output coefficients to identify genes most strongly affected by the perturbation. The model's sparsity constraint ensures only a subset of genes will have non-zero effect sizes.
- Utilize the built-in uncertainty estimates to gauge the confidence of each predicted effect.

Table: Key Computational Tools for scFM and Perturbation Modeling

Tool Name	Type	Primary Function in Perturbation Modeling	Reference/Source
BioLLM	Software Framework	Unified interface for integrating and benchmarking multiple scFMs.	[5]
Pertpy	Python Toolkit	Provides perturbation analysis methods, including Augur.	[54]
scGPT	Foundation Model	Transformer-based scFM for cell and gene embedding; excels in multiple tasks.	[1] [5]
GPerturb	Perturbation Model	Gaussian process model for sparse, interpretable effect estimation with uncertainty.	[55]
CPA	Perturbation Model	Autoencoder to predict counterfactual expression under different perturbations.	[55]
CZ CELLxGENE	Data Catalog	Platform providing access to millions of curated single-cell datasets for pretraining.	[1]

Effectively addressing the non-sequential nature of gene expression data is a cornerstone of modern computational biology. The tokenization strategies, benchmarking protocols, and specialized perturbation models detailed in this Application Note provide a robust and standardized pathway for researchers to leverage the full power of single-cell Foundation Models. By transforming unordered transcriptomic data into a structured format that AI models can interpret, we unlock the potential to perform high-fidelity in silico simulations of genetic and chemical perturbations. This capability is poised to dramatically accelerate therapeutic discovery and deepen our understanding of fundamental cellular processes.

In the field of in silico perturbation modeling with single-cell Foundation Models (scFMs), data quality is not merely a technical concern but a fundamental determinant of model reliability and biological insight. Batch effects, technical noise, and data inconsistencies represent significant challenges that can compromise the integrity of computational predictions. Batch effects are defined as unwanted technical variations introduced due to differences in laboratory conditions, instrumentation, reagent lots, or personnel [56] [57]. In the context of perturbation modeling, where the goal is to understand causal relationships by predicting system responses to interventions, these artifacts can create false predictions or obscure true biological signals [4] [56].

The integration of diverse, large-scale perturbation datasets is central to training robust large perturbation models (LPMs) and scFMs. These models learn to disentangle perturbation (P), readout (R), and context (C) dimensions to predict experimental outcomes [4]. However, this integration is critically dependent on data harmonization. Technical variations can severely hinder the model's ability to learn generalizable rules, leading to inaccurate predictions of post-perturbation cellular states and misidentification of molecular mechanisms [4] [56]. Therefore, implementing rigorous protocols for assessing and mitigating batch effects is a prerequisite for biologically meaningful in silico discovery.

Understanding Batch Effects and Technical Noise in Single-Cell Data

Technical noise in single-cell and spatial transcriptomics arises from multiple sources throughout the experimental workflow. Major sources include variability in sample preparation protocols, differences in sequencing platforms and library preparation kits, reagent batch variations, and environmental conditions [56] [58]. In mass-spectrometry-based proteomics, the problem is further compounded by the multi-step data transformation process from spectra to peptides to proteins, creating multiple potential entry points for batch effects [59] [57].

The consequences of unaddressed batch effects are profound. They can:

Generate false positives: Technical variation can be misinterpreted as biological signal, leading to erroneous identification of differentially expressed genes or proteins [56] [58].
Mask true biological signals: Strong batch effects can obscure genuine biological differences, reducing statistical power and leading to false negatives [56].
Compromise reproducibility: Batch effects are a paramount factor contributing to the reproducibility crisis in biomedical research, potentially resulting in retracted articles and invalidated findings [56].
Mislead predictive models: For in silico perturbation models, batch effects can result in inaccurate predictions of cellular responses to genetic or chemical perturbations [4].

Special Considerations for Perturbation Modeling

The recently developed Large Perturbation Model (LPM) architecture demonstrates the critical importance of high-quality, harmonized data. LPMs integrate heterogeneous perturbation experiments by representing perturbation, readout, and context as disentangled dimensions [4]. This approach enables predicting outcomes of unobserved perturbation experiments and identifying shared molecular mechanisms across perturbation types. However, the model's performance is contingent on its ability to learn perturbation-response rules that are generalizable across contexts, a task severely hampered by unaddressed batch effects [4].

Assessment and Diagnostic Protocols for Data Quality Issues

Visual Diagnostic Methods

Effective batch effect correction begins with comprehensive assessment. Visual methods provide an intuitive first look at data structure and potential technical artifacts:

Principal Component Analysis (PCA): Plot samples in reduced dimensions, coloring by batch. Batch effects are evident when samples cluster strongly by technical factors rather than biological conditions [58].
UMAP Visualization: Particularly useful for single-cell data, UMAP can reveal batch-driven clustering patterns that may obscure biological heterogeneity [58].
Side-by-side Boxplots: Visualize correlations of batch effects with principal components or investigate feature-wise distributions across batches [57].

Quantitative Metrics for Batch Effect Assessment

Beyond visual inspection, quantitative metrics provide objective measures of batch effect severity and correction efficacy:

Table 1: Quantitative Metrics for Assessing Batch Effects

Metric	Description	Interpretation	Ideal Value
Average Silhouette Width (ASW)	Measures clustering tightness and separation	Higher values indicate better batch mixing while preserving cell type identity	High for cell type, low for batch
Adjusted Rand Index (ARI)	Measures similarity between two clusterings	Higher values indicate better preservation of biological clusters after correction	Close to 1
Local Inverse Simpson's Index (LISI)	Quantifies diversity of batches in local neighborhoods	Higher values indicate better batch mixing	High
kBET Acceptance Rate	Tests whether batch labels are random in local neighborhoods	Higher rates indicate successful batch mixing	Close to 1

These metrics should be applied both before and after correction to quantitatively evaluate the effectiveness of the chosen batch effect correction strategy [58].

Experimental Protocols for Batch Effect Mitigation

Strategic Experimental Design

The most effective approach to batch effects is prevention through careful experimental design:

Sample Randomization: Distribute biological conditions and groups evenly across processing batches to avoid confounding [58].
Balanced Processing: Process all experimental groups simultaneously whenever possible, using consistent reagents and protocols [56].
Reference Materials: Incorporate universal reference samples (such as the Quartet protein reference materials) in each batch to facilitate technical variation assessment [59] [56].
Replication Strategy: Include at least two replicates per group per batch to enable robust statistical modeling of batch effects [58].

Computational Correction Workflows

When batch effects cannot be prevented through design alone, computational correction is necessary. The workflow differs for various data types and analytical goals:

Diagram 1: BE Correction Workflow - This diagram outlines the decision process for implementing batch effect correction in omics data analysis, from initial assessment through validation for in silico modeling.

Proteomics Data Correction Protocol

For mass-spectrometry-based proteomics data, recent benchmarking studies using the Quartet protein reference materials provide clear guidance:

Data Level Selection: Protein-level correction consistently demonstrates superior robustness compared to precursor or peptide-level correction across multiple quantification methods and batch effect correction algorithms [59].
Algorithm Selection: Test multiple algorithms as performance varies by context:
- ComBat: Empirical Bayes framework effective for known batch factors [59] [58]
- Ratio-based Methods: Intensity ratios relative to reference samples, particularly effective with MaxLFQ quantification [59]
- RUV-III-C: Linear regression model to estimate and remove unwanted variation [59]
Performance Validation: Assess correction using:
- Coefficient of variation (CV) within technical replicates across batches
- Signal-to-noise ratio (SNR) in differentiating biological groups
- Principal variance component analysis (PVCA) to quantify biological vs. technical variance contributions [59]

Table 2: Batch Effect Correction Algorithms and Their Applications

Algorithm	Mechanism	Best For	Considerations
ComBat	Empirical Bayes adjustment for known batches	Bulk data with defined batch structure	Requires known batch info; may not handle nonlinear effects [59] [58]
SVA	Estimates hidden sources of variation	When batch variables are unknown	Risk of removing biological signal; requires careful modeling [58] [57]
Harmony	Iterative clustering in reduced dimension space	Single-cell data integration	Preserves biological variation while aligning batches [59] [58]
Ratio-based	Sample intensity relative to reference standards	Multi-batch proteomics studies	Requires universal reference materials [59]
WaveICA2.0	Multi-scale decomposition with injection order	MS-data with signal drift over time	Addresses continuous drift effects in large sample sets [59] [57]

Transcriptomics Data Correction Protocol

For single-cell RNA-seq data in perturbation studies:

Preprocessing: Quality control to remove low-quality cells (high mitochondrial percentage, low gene counts) [60]
Integration: Use Harmony or similar algorithms to align cells across batches while preserving biological heterogeneity [60] [58]
Validation: Apply both visual (UMAP) and quantitative (LISI, ASW) metrics to ensure batch mixing without biological signal loss [58]

Integration withIn SilicoPerturbation Modeling

Data Quality Requirements for scFMs and LPMs

Single-cell Foundation Models and Large Perturbation Models have specific data quality requirements that must be addressed through proper batch correction:

Disentangled Representations: LPMs explicitly disentangle perturbation, readout, and context dimensions. Batch effects can blur these distinctions, reducing model accuracy in predicting outcomes for unseen perturbations [4].
Cross-Platform Generalization: Effective perturbation models must generalize across experimental contexts. Batch effects that correlate with platform-specific factors hinder this capability [4].
Mechanistic Insight: Quality-controlled data enables LPMs to accurately associate genetic and chemical perturbations that share molecular mechanisms, as demonstrated by the clustering of mTOR inhibitors with genetic perturbations targeting MTOR in the learned embedding space [4].

Quality Assurance Protocol for Modeling Pipelines

Implement these quality control checkpoints before training or fine-tuning perturbation models:

Pre-Training Check: Verify that negative control samples (unperturbed) cluster by biological identity rather than batch origin
Perturbation Specificity Test: Ensure that perturbation effects are larger than technical variations between replicates
Cross-Validation: Assess model performance across different experimental batches to detect residual batch effects
Benchmarking: Compare predictions against held-out experimental data or gold-standard reference datasets

Diagram 2: LPM Modeling Pipeline - This diagram shows how batch-corrected data feeds into Large Perturbation Model training and enables multiple biological discovery tasks that ultimately validate therapeutic hypotheses.

Table 3: Research Reagent Solutions for Quality-Assured Perturbation Modeling

Resource	Type	Function	Example Use Case
Quartet Reference Materials	Biological standards	Multi-level quality control for proteomics	Assessing batch effect correction efficacy across labs [59]
SuPreMo Tool	Computational framework	In silico mutagenesis and sequence perturbation	Generating variant sequences for input to predictive models [61]
SingleR	Cell type annotation	Automated cell type identification	Ensuring consistent cell labeling across batches [60]
CellChat	Cell communication analysis	Inference of intercellular signaling networks	Studying how perturbations affect cell-cell communication [60]
InferCNV	Copy number variation analysis	Detection of CNVs from scRNA-seq data	Distinguishing malignant from non-malignant cells in tumor samples [60]
Harmony	Batch integration algorithm	Aligning datasets in reduced dimension space	Integrating single-cell data across multiple patients or conditions [60]

Mitigating data quality issues is not a mere preprocessing step but a foundational requirement for robust in silico perturbation modeling. As single-cell Foundation Models and Large Perturbation Models continue to advance in sophistication and application scope, the integrity of their predictions will remain critically dependent on the quality of their training data. By implementing the systematic assessment protocols, correction strategies, and validation frameworks outlined in this document, researchers can significantly enhance the reliability of their computational models. This rigorous approach to data quality ensures that model predictions reflect genuine biology rather than technical artifacts, ultimately accelerating the discovery of novel therapeutic targets and biological mechanisms through more trustworthy in silico experimentation.

The application of single-cell foundation models (scFMs) has revolutionized our ability to interpret cellular heterogeneity and complex regulatory networks, positioning them as pivotal tools in computational biology and drug discovery [1]. These models, typically built on transformer architectures, are pretrained on vast datasets encompassing millions of single-cell transcriptomes to learn fundamental biological principles [1]. However, this capability comes with significant computational costs, creating a major bottleneck for widespread adoption. The training and fine-tuning of these large-scale deep learning models demand intensive computational resources, creating a significant bottleneck for their widespread adoption [1]. Effectively managing these resource demands is particularly crucial within the context of in silico perturbation (ISP) modeling, where researchers aim to create accurate "virtual cell" models that can simulate cellular responses to genetic and chemical perturbations without extensive wet-lab experimentation [9]. This application note provides a structured framework and practical protocols for optimizing computational efficiency when working with scFMs, enabling researchers to balance model performance with practical resource constraints.

Quantitative Landscape of scFM Computational Demands

Understanding the specific resource requirements of different scFMs is essential for project planning and infrastructure allocation. The computational intensity varies significantly across models based on their architecture, parameter count, and pretraining strategies.

Table 1: Computational Profiles of Prominent Single-Cell Foundation Models

Model	Parameter Scale	Primary Architecture	Key Resource Intensifiers	Noted Efficiency Features
scGPT	Not Specified	GPT-based Decoder	Flash-attention blocks, random gene identity embeddings [5]	Superior memory usage and computational time efficiency [5]
Geneformer	30M-12L to 106M-12L	Transformer	Model depth, attention mechanisms	Efficient cell embedding generation [5] [9]
scBERT	Smaller Scale	BERT-like Encoder	Bidirectional attention, gene2vec embeddings [5]	Higher memory consumption relative to performance [5]
scFoundation	Not Specified	Transformer	Pretraining corpus size, embedding dimensions	Moderate computational efficiency [5]

Table 2: Impact of Input Dimensions on Computational Load

Factor	Effect on Memory	Effect on Training Time	Performance Correlation
Input Gene Sequence Length	Linear increase with longer sequences [5]	Significant increase with longer sequences	scGPT improves with longer inputs; scBERT declines [5]
Batch Size	Proportional increase	Decreases with larger batches (to a point)	Optimal batch size varies by model architecture
Dataset Integration Complexity	Higher with cross-technology batches [5]	Extended processing for batch correction	Model-dependent: scGPT handles consistency better than cross-technology [5]

Benchmarking studies reveal that model performance does not always correlate with computational footprint. In comprehensive evaluations, scGPT consistently demonstrated superior computational efficiency in terms of both memory usage and processing time, while scBERT showed declining performance with increasing input sequence length despite significant resource consumption [5]. This highlights the importance of selecting models based not only on reported accuracy but also on their computational characteristics for specific tasks.

Strategic Approaches to Computational Efficiency

Parameter-Efficient Fine-Tuning (PEFT) with LoRA

The Low-Rank Adaptation (LoRA) technique has emerged as a transformative approach for optimizing computational workload during fine-tuning. LoRA operates on a mathematical insight that weight updates during adaptation have a low "intrinsic rank" and can be represented in a much lower-dimensional space [62].

Instead of updating all parameters in a weight matrix W (with dimensions d×k), LoRA freezes the pre-trained weights and injects trainable rank decomposition matrices. The modified forward pass is represented as: h = W₀x + ΔWx = W₀x + BAx where A ∈ R^{r×k} and B ∈ R^{d×r} are the trainable adaptation matrices, and the rank r ≪ min(d,k) [62].

Table 3: LoRA Configuration for scFM Fine-Tuning

Component	Recommended Setting	Resource Impact	Performance Consideration
Rank (r)	4-16	Higher rank increases trainable parameters	Balance between adaptability and overfitting
Alpha	2×rank	Scaling factor for adapted weights	Affects learning rate sensitivity
Target Modules	Attention layers (query, value)	Determines which components are adapted	Critical for maintaining pretrained knowledge
Dropout	0.1	Regularization during adaptation	Reduces overfitting to small datasets

Practical implementation of LoRA can reduce trainable parameters by up to 98.4% compared to full fine-tuning, enabling adaptation of billion-parameter models on consumer-grade GPUs with minimal performance degradation [62]. For ISP tasks, this allows researchers to efficiently specialize models for predicting cellular responses to perturbations without prohibitive computational costs.

Multi-Fidelity Optimization for Architecture Search

The GenomeNet-Architect framework demonstrates how multi-fidelity optimization can dramatically reduce computational overhead when tailoring architectures for genomic tasks. This approach uses cheaper approximations of model performance during initial search phases, allocating full resources only to the most promising candidates [63].

The key innovation is the progressive allocation of computational budget: initial configurations are evaluated with shorter training times and smaller data subsets, while only top-performing candidates receive full training cycles. This strategy can reduce overall search time by 67-83% while identifying architectures that outperform expert-designed models [63].

For scFM implementation, this translates to:

Rapid prototyping with reduced epoch counts and minimal data sampling
Progressive budget allocation to promising configurations
Early stopping for underperforming model variants
Transferable architecture patterns across related biological tasks

Closed-Loop Fine-Tuning with Strategic Experimental Design

In the context of ISP modeling, the closed-loop framework demonstrates how strategic incorporation of experimental data can optimize computational efficiency. This approach integrates perturbation data during model fine-tuning, significantly improving prediction accuracy while managing resource demands [9].

Remarkably, research shows that just 10-20 well-chosen perturbation examples can produce substantial improvements in model accuracy, with performance metrics approaching saturation at approximately 20 examples [9]. This suggests that computationally intensive fine-tuning on massive perturbation datasets may be unnecessary for effective ISP modeling.

Experimental Protocols for Resource-Constrained Environments

Protocol: Efficient Fine-Tuning for In Silico Perturbation Prediction

Objective: Adapt pre-trained scFMs to predict cellular responses to genetic perturbations while minimizing computational resource requirements.

Materials and Computational Resources:

Pre-trained scFM (Geneformer-30M-12L recommended for balance of performance and efficiency [9])
Base computing instance: GPU with 16-24GB VRAM (e.g., NVIDIA RTX 3080/A4000)
Fine-tuning framework with LoRA implementation
Perturbation dataset with minimum of 20 representative examples [9]

Procedure:

Model Initialization:
- Load pre-trained weights and freeze base parameters
- Configure LoRA adapters with rank=8, alpha=16 targeting attention layers
- Set dropout rate of 0.1 for regularization

Data Preparation:
- Curate perturbation examples including both genetic and chemical perturbations
- Balance dataset to include responses shifting cells toward both activated and resting states
- Format input sequences using model-specific tokenization (gene ranking by expression levels)
Progressive Fine-Tuning:
- Begin with reduced sequence length (1,000-2,000 genes) for initial epochs
- Use batch size of 8-16 constrained by GPU memory
- Apply gradient accumulation to maintain effective batch size
- Implement early stopping with patience of 3-5 epochs
Validation and Iteration:
- Evaluate on hold-out perturbation set after each epoch
- Monitor positive predictive value (PPV) and negative predictive value (NPV)
- Gradually increase sequence length if performance plateaus

Expected Outcomes: This protocol should achieve a three-fold improvement in PPV (from 3% to 9%) for perturbation prediction while maintaining NPV above 98%, comparable to published closed-loop ISP results [9]. Total training time typically ranges from 2-6 hours on a single GPU, depending on dataset size and model architecture.

Protocol: Computational Budgeting for Model Selection

Objective: Systematically select appropriate scFMs based on task requirements and computational constraints.

Materials:

Multiple scFM options (scGPT, Geneformer, scBERT, scFoundation)
Target dataset with defined biological task
Benchmarking framework (BioLLM recommended [5])

Procedure:

Task Characterization:
- Classify task type (cell embedding, perturbation prediction, batch correction)
- Assess dataset size and complexity (number of cells, genes, batches)
- Define performance requirements and computational constraints

Efficiency Profiling:
- Extract zero-shot embeddings from candidate models
- Evaluate quality using Average Silhouette Width (ASW) for biological relevance
- Measure computational time and memory usage
- Assess batch-effect-removal capabilities using batch-specific ASW
Trade-off Analysis:
- Plot performance vs. computational cost for each model
- Identify Pareto-optimal candidates
- Select model demonstrating best efficiency for specific task type

Decision Framework:

For cell embedding tasks with computational constraints: Prioritize scGPT for its consistent performance and efficiency [5]
For perturbation prediction with limited labeled data: Implement Geneformer with closed-loop fine-tuning [9]
For resource-intensive applications with large datasets: Employ scGPT with LoRA adaptation [5] [62]

The Scientist's Toolkit: Essential Research Reagents & Computational Solutions

Table 4: Key Computational Tools for Managing scFM Resource Demands

Tool/Resource	Function	Implementation Benefit
BioLLM Framework	Unified interface for diverse scFMs [5]	Standardized APIs eliminate architectural inconsistencies, reduce implementation overhead
LoRA (Low-Rank Adaptation)	Parameter-efficient fine-tuning [62]	Enables adaptation of billion-parameter models on consumer GPUs
GenomeNet-Architect	Neural architecture optimization [63]	Automates model selection, reduces design time by 67-83%
Closed-Loop ISP Framework	Iterative model refinement with experimental data [9]	Maximizes information gain from minimal perturbation examples (10-20 samples)
Multi-Fidelity Optimization	Progressive resource allocation [63]	Reduces computational waste during hyperparameter tuning
Flash-Attention Blocks	Memory-efficient attention computation [5]	Enables processing of longer gene sequences within memory constraints

Visualizing Computational Workflows

Resource-Aware scFM Implementation Strategy

Managing the computational intensity of scFM training and fine-tuning requires a strategic approach that balances model performance with practical resource constraints. Through the implementation of parameter-efficient fine-tuning techniques like LoRA, strategic model selection informed by benchmarking studies, and resource-aware experimental design, researchers can effectively leverage scFMs for in silico perturbation modeling without prohibitive computational costs. The protocols and frameworks presented here provide a pathway for implementing these models in diverse research environments, from academic laboratories to industrial drug discovery programs. As the field evolves, continued development of computational efficiency methods will be essential for democratizing access to scFM technologies and realizing their full potential for biological discovery and therapeutic development.

The application of single-cell foundation models (scFMs) and other deep learning approaches in biology has created a profound interpretability gap. While these models demonstrate impressive predictive accuracy, their internal representations and decision-making processes often remain opaque black boxes [64] [65]. This opacity presents significant challenges for drug development and biological discovery, where understanding mechanism is as crucial as prediction. This document outlines the core challenges, provides protocols for evaluating latent embeddings, and presents visualization strategies to enhance interpretability within in silico perturbation modeling research.

The Core Interpretability Challenge in Biological AI

In biological modeling, the black box problem manifests uniquely and with high stakes. When models like scGPT or scFoundation predict cellular responses to perturbations, we often cannot identify why they make specific predictions or what biological mechanisms they have learned [64] [66]. This limitation has direct consequences:

Inaccessible Biological Insights: Models may learn novel biological relationships that remain hidden within billions of parameters [64]
Spurious Predictions: Inability to detect when models make predictions for non-biological or artifact-driven reasons [64]
Limited Therapeutic Guidance: Protein engineering and drug discovery lack principled guidance from model understanding [64]

Recent benchmarking studies reveal that even simple baseline models (e.g., taking the mean of training examples) can outperform complex foundation models in predicting post-perturbation gene expression [66]. Furthermore, basic machine learning models incorporating biologically meaningful features like Gene Ontology vectors significantly outperform scFMs, suggesting that the latent embeddings in these foundation models may not be capturing the most biologically relevant information [66].

Quantitative Benchmarking of Latent Embedding Quality

Table 1: Performance Comparison of Embedding Types in Perturbation Prediction

Embedding Type	Model/Dataset	Pearson Delta (Adamson)	Pearson Delta (Norman)	Biological Interpretability
GO Term Features	Random Forest	0.739	0.586	High (direct biological annotation)
scELMO Embeddings	Random Forest	0.706	0.663	Moderate (text-derived semantics)
scGPT Embeddings	Random Forest	0.727	0.583	Low (model-derived, opaque)
scGPT Embeddings	Fine-tuned scGPT	0.641	0.554	Very Low
scFoundation Embeddings	Fine-tuned scFoundation	0.552	0.459	Very Low
Train Mean Baseline	None	0.711	0.557	None

Data adapted from critical benchmarking studies of post-perturbation RNA-seq prediction models [66]. Pearson Delta measures correlation in differential expression space, with higher values indicating better performance.

The benchmarking data reveals a crucial insight: using foundation model embeddings as features in simpler, interpretable models like Random Forests often yields better performance than the fine-tuned foundation models themselves [66]. This suggests that the information is present in the embeddings but may not be optimally utilized by the complex architectures.

Experimental Protocols for Evaluating Biological Meaning

Protocol 3.1: Feature Activation Analysis for Biological Discovery

Purpose: To validate whether latent features correspond to genuine biological mechanisms rather than dataset artifacts.

Materials:

Trained model with latent embeddings (e.g., scFM, GEDI, or NOBLE)
Reference biological databases (Swiss-Prot, InterPro, GO)
Statistical analysis environment (Python/R)

Procedure:

Identify High-Activation Features: Extract latent dimensions that show strong, consistent activation patterns across cell populations or conditions [64]
Cross-Reference with Known Biology: For each high-activation feature, identify genes/proteins that correlate with its activation pattern [64]
Database Validation: Check if correlated genes belong to known biological pathways or complexes using annotation databases [64]
Identify Annotation Gaps: Note cases where strong feature activation occurs on unannotated or poorly characterized genes—these may represent novel biological discoveries [64]
Experimental Triangulation: When possible, use orthogonal experimental data (e.g., protein structures, evolutionary conservation) to validate putative novel annotations [64]

Example Application: In the InterPLM study, feature f/939 consistently activated on proteins with a "Nudix box motif." When one strongly activating protein (B2GFH1) lacked this annotation in Swiss-Prot, researchers confirmed through InterPro and structural analysis that it indeed contained a Nudix box—revealing a missing database annotation rather than a model error [64].

Purpose: To ensure latent embeddings capture biologically consistent patterns across different data modalities.

Materials:

Multi-modal biological data (e.g., gene expression + protein structure + evolutionary data)
Model with latent representations
Integration framework (e.g., GEDI for multi-sample single-cell data)

Procedure:

Train Multi-Modal Encoders: Develop separate encoders for each data modality that project into a shared latent space [67]
Assess Cross-Modal Consistency: Measure whether biologically similar samples cluster together regardless of input modality [67]
Perform Ablation Studies: Systematically remove specific biological priors from the model to determine which factors drive latent organization [68]
Validate with Experimental Variability: Use frameworks like NOBLE to test whether latent spaces capture natural biological variability observed in experimental data [68]
Pathway Alignment: Incorporate gene-set priors to align latent dimensions with known biological pathways and assess enrichment [67]

Example Application: The GEDI framework enables sample-specific transformations of a reference latent manifold, allowing researchers to disentangle technical variability from genuine biological signals and directly associate latent dimensions with sample characteristics like disease severity [67].

Visualization Strategies for Interpretable Latent Spaces

Effective visualization is crucial for interpreting complex biological latent spaces. The following workflow provides a systematic approach to creating interpretable visualizations of scFM embeddings:

Color Application Workflow for Biological Data

Short Title: Color Visualization Workflow

When applying color to latent space visualizations, follow these evidence-based rules derived from colorization research [69]:

Identify Data Nature: Categorical data (cell types) need distinct hues; continuous data (expression gradients) need perceptually ordered color ramps [69]
Select Perceptually Uniform Color Spaces: CIE Luv and CIE Lab spaces ensure equal perceptual distance corresponds to equal numerical distance [69]
Check Color Context: Colors interact with surrounding colors—test in actual visualization context [69]
Assess Accessibility: Approximately 8% of males have color vision deficiency—test with color deficiency simulators [69]

The Scientist's Toolkit: Essential Research Reagents

Table 2: Key Computational Tools for Latent Space Interpretation

Tool/Resource	Type	Primary Function	Interpretability Features
Sparse Autoencoders (SAEs)	Interpretation Method	Extract interpretable features from model internals	Identifies monosemantic features corresponding to biological concepts [64]
GEDI Framework	Analysis Framework	Multi-sample single-cell analysis	Enables cluster-free differential expression along cell state continuum [67]
NOBLE	Neural Operator	Captures experimental variability in neuron models	Biologically-informed latent embeddings for neural dynamics [68]
PertEval-scFM	Benchmarking Framework	Standardized evaluation of scFMs	Assesses zero-shot embedding quality for perturbation prediction [70]
scPerturb	Data Resource	Harmonized single-cell perturbation data	Provides ground truth for evaluating predictive models [71]
CellOracle	GRN-Based Prediction	Infers gene regulatory networks	Mechanistically interpretable perturbation predictions [71]

Case Study: Successful Biological Discovery Through Interpretable Features

The analysis of feature f/19746 in the Evo 2 DNA foundation model demonstrates how interpretable latent features can lead to genuine biological discovery [64]. This feature consistently activated across prophage regions in bacterial genomes, including previously unannotated regions. When researchers investigated, they found these regions contained phage-associated genes like integrases and invertases. Crucially, the feature activation pattern revealed the model had learned the functional relationship between CRISPR systems and phage immunity rather than superficial sequence similarity—when researchers scrambled CRISPR spacer sequences, activation persisted, but scrambling the direct repeats eliminated activation [64].

This case exemplifies the potential of interpretability methods to function as discovery tools that can identify missing biological annotations and reveal deeper functional relationships learned by the models.

Addressing interpretability challenges in biological latent embeddings requires both technical advances and cultural shifts in how we evaluate computational models. Promising directions include:

Biologically-Informed Architectures: Frameworks like NOBLE that incorporate biological priors directly into model structure [68]
Unified Analysis Frameworks: Approaches like GEDI that integrate multiple analysis steps into coherent interpretable frameworks [67]
Benchmarking Standards: Initiatives like the Virtual Cell Challenge that establish rigorous evaluation standards [71]
Cross-Modal Validation: Leveraging multiple data types to triangulate biological meaning in latent spaces [67]

As the field progresses, the goal should not be merely to predict cellular behaviors but to understand them. The protocols and frameworks outlined here provide a pathway toward models that are not just predictive but truly explanatory, accelerating drug development and biological discovery through interpretable in silico perturbation modeling.

Mode collapse occurs when a machine learning model fails to capture the full diversity of the underlying data distribution, producing limited or repetitive predictions. Within the specialized field of in silico perturbation modeling with single-cell foundation models (scFMs), this manifests as an inability to accurately predict the unique cellular responses—specifically, changes in gene expression—elicited by diverse genetic or chemical perturbations [72] [6]. Instead, a collapsed model may default to predicting an average response, thereby obscuring the specific biological signals crucial for therapeutic discovery. Recent benchmarks have revealed a troubling anomaly: sophisticated perturbation-response models are frequently outperformed by a simplistic baseline that predicts the average of all perturbed cells in the training set, disregarding the individual perturbation label [72]. This indicates a systemic issue with how model performance is evaluated and underscores the critical need for robust solutions to ensure predictive diversity.

Identifying Mode Collapse in scRNA-seq Perturbation Modeling

Quantitative Signatures and Diagnostic Metrics

The primary quantitative signature of mode collapse is anomalously high performance on traditional metrics like unweighted Mean Squared Error (MSE) or control-referenced metrics such as Pearson(Δ), coupled with a failure to recapitulate the effects on specific, differentially expressed genes (DEGs) [72]. A definitive diagnostic check involves comparing your model's performance against a mean baseline—a model that always predicts the average expression profile across the entire training dataset. If a complex model fails to significantly outperform this naive baseline on DEG-aware metrics, it is likely suffering from mode collapse. The core issue is that standard metrics can be gamed by accurately predicting the large, uninteresting regions of the gene expression space that remain unchanged by a perturbation, while missing the critical, albeit smaller, niche signals [72].

Table 1: Key Diagnostic Metrics for Identifying Mode Collapse

Metric Name	Traditional Use & Pitfall	Proposed Robust Alternative	Interpretation in Diagnostics
Mean Squared Error (MSE)	Measures average L2 error; rewards accuracy on non-changing genes, favoring mean prediction [72].	Weighted MSE (WMSE) [72]	A model exhibiting collapse will have similar traditional MSE and WMSE, with poor WMSE.
Pearson(Δ)	Correlates control-referenced delta; inflated by systematic control bias [72].	Weighted Delta R² (R²w(Δ)) [72]	Collapsed models show high Pearson(Δ) but low R²w(Δ), indicating failure to predict true effect sizes.
Mean Baseline Performance	A naive predictor that outputs the dataset mean; used as a negative control [72].	Comparison against this baseline using WMSE/R²w(Δ).	A specialist model should significantly and consistently outperform the mean baseline.

Experimental Protocol for Diagnosing Mode Collapse

Objective: To determine whether a given scFM for perturbation prediction is experiencing mode collapse. Inputs: Trained perturbation-response model, held-out test set of single-cell perturbation data. Procedure:

Compute Baselines:
- Mean Baseline: For each gene, calculate its average expression across all perturbed cells in the training set. This vector is the "mean baseline" prediction for any perturbation.
- Positive Control (Optional): Establish an upper performance bound using a technical duplicate baseline, which estimates optimal performance given the intrinsic noise of the dataset [72].
Generate Predictions: Run the trained model and the mean baseline on the held-out test set.
Calculate Metrics:
- Compute traditional metrics (e.g., MSE, Pearson(Δ)) for both your model and the mean baseline.
- Compute robust, DEG-aware metrics (e.g., WMSE, R²w(Δ)) for both models.
Analyze and Interpret:
- Plot the performance of your model against the mean and positive baselines across all metrics.
- Diagnosis: If your model's performance on robust metrics (WMSE, R²w(Δ)) is not significantly better than the mean baseline, it is likely experiencing mode collapse.

This diagnostic workflow is visualized in the following diagram, which outlines the key decision points and analyses required to confirm the presence of mode collapse.

Figure 1: A diagnostic workflow for identifying mode collapse in perturbation models by comparing model performance against a mean baseline using both traditional and robust metrics.

Overcoming Mode Collapse: Experimental Protocols

Protocol A: Implementing DEG-Aware Weighted Loss Functions

Principle: Directly counter mode collapse by modifying the training objective to prioritize accurate prediction of genes that are most likely to change in response to perturbations [72]. Solution: Replace the standard MSE loss with a Weighted Mean Squared Error (WMSE) loss function. WMSE assigns a higher weight to genes that are known to be differentially expressed across the spectrum of perturbations in the training data, forcing the model to focus its capacity on these informative features. Procedure:

Precompute Gene Weights:
- For each gene, perform a statistical test (e.g., ANOVA) across all perturbation states (including control) in the training data to assess its overall variability.
- Convert the resulting p-values or test statistics into a weight for each gene. A common method is to assign a weight of 1 to non-significant genes and a weight >1 (e.g., 5 or 10) to genes deemed differentially expressed after multiple-testing correction.
Integrate into Loss Function:
- The WMSE for a single prediction is calculated as: WMSE = (1/N) * Σ(weight_gene_i * (true_expression_gene_i - predicted_expression_gene_i)²).
- Implement this weighted sum in your model's training loop.
Train and Validate:
- Train the model using the WMSE loss.
- Validate the model using the robust metrics described in Section 2.1 to ensure improved performance on the DEG signals.

Protocol B: Closed-Loop Fine-Tuning with Experimental Data

Principle: Leverage limited, high-quality experimental perturbation data to guide and correct the model's predictions, pulling it out of a collapsed state [9]. Solution: Implement a closed-loop fine-tuning framework where the scFM is iteratively updated with data from targeted Perturb-seq experiments. Procedure:

Initial Model: Start with a pre-trained or base fine-tuned scFM (e.g., Geneformer).
Acquire Perturbation Data: Conduct or obtain a scRNA-seq dataset (e.g., Perturb-seq) where specific genetic or chemical perturbations have been applied. The number of unique perturbations can be relatively small (e.g., 10-20) to see substantial benefits [9].
Fine-Tuning: Further fine-tune the model on a combined dataset that includes:
- The original single-cell data used for initial training/fine-tuning.
- The new perturbation data. Critically, this fine-tuning can use a simplified objective, such as classifying cellular state (e.g., activated vs. resting) from the transcriptome, which incorporates the perturbation effects without requiring explicit gene-level regression [9].
Iterate: Use the improved "closed-loop" model to make new in silico predictions, which can then be used to prioritize further experiments, creating a virtuous cycle of prediction and validation.

The following diagram illustrates the iterative and cumulative nature of this powerful approach.

Figure 2: The closed-loop fine-tuning protocol for overcoming mode collapse by iteratively incorporating experimental data.

Protocol C: Loss-Guided Exploration for Generative Models

Principle: Actively explore the state space to discover high-reward, unseen modes that the primary model currently misses. This is particularly relevant for generative models like GFlowNets used in biological sequence or perturbation design [73]. Solution: Employ a Loss-Guided GFlowNet (LGGFN) architecture, where an auxiliary agent's exploration is directed toward regions where the main model exhibits high training loss. Procedure:

Model Setup: Implement a primary GFlowNet and an auxiliary GFlowNet.
Training Loop:
- During training, the primary model learns to sample objects (e.g., gene sequences, perturbation combinations) proportional to a given reward.
- The auxiliary model is trained not on the reward function directly, but to prioritize sampling trajectories (sequences of actions) that lead to objects for which the primary model has a high loss.
Outcome: This forces the exploration of under-explored, high-reward regions of the state space, significantly accelerating the discovery of diverse, valid modes and reducing mode collapse [73].

Table 2: Key resources for developing robust in silico perturbation models.

Category	Item / Software	Function in Perturbation Modeling
Benchmark Datasets	Norman et al. (2019) [72], Replogle et al. (2022) [72]	Standardized public datasets for training and benchmarking genetic perturbation models.
Computational Models	scGPT [72] [4], Geneformer [4] [9], GEARS [72] [4], CPA [4], LPM [4]	Foundational and specialized models for single-cell analysis and perturbation prediction.
Evaluation Metrics	Weighted MSE (WMSE) [72], Weighted Delta R² (R²w(Δ)) [72]	Robust, DEG-aware metrics to properly evaluate model performance and diagnose collapse.
Experimental Data	Perturb-seq [9] / CRISPR-screens	High-quality ground-truth data for closed-loop fine-tuning and model validation.

In silico perturbation modeling with single-cell foundation models (scFMs) represents a paradigm shift in computational biology, enabling the prediction of cellular responses to genetic or chemical interventions. These models, pretrained on vast single-cell transcriptomics corpora, learn fundamental biological principles that can be adapted to specialized tasks through transfer learning. The optimization of scFMs hinges on three interconnected pillars: strategic data curation to ensure biological comprehensiveness and technical quality, thoughtful model architecture selection to capture complex gene-gene interactions, and effective transfer learning protocols to bridge general pretraining with specific applications. This framework provides the foundational methodology for realizing the potential of "virtual cell" models in accelerating therapeutic discovery and mechanistic biology.

Data Curation Strategies

Data Sourcing and Compilation

The development of robust scFMs requires training on extensive, diverse, and high-quality single-cell datasets that capture a wide spectrum of biological conditions. Strategic data curation begins with leveraging large-scale repositories that provide standardized access to millions of single-cell profiles.

Table 1: Primary Data Sources for scFM Pretraining

Data Source	Scale	Key Features	Applications
CZ CELLxGENE [1]	>100 million cells [1]	Unified access to annotated single-cell datasets [1]	General pretraining, cross-tissue analysis
Human Cell Atlas [1]	Multiorgan coverage [1]	Broad coverage of cell types and states [1]	Reference cell type embedding
PanglaoDB [1]	Curated compendium [1]	Data from multiple sources and studies [1]	Specialized model development
NCBI GEO/SRA [1]	Thousands of studies [1]	Extensive repository of sequencing data [1]	Supplemental training data

The assembly of a high-quality, nonredundant dataset is as critical as model architecture for building robust scFMs [1]. This process requires careful selection of datasets, filtering of cells and genes, balancing dataset compositions, and implementing rigorous quality controls to address challenges such as varying sequencing depth, batch effects, technical noise, and inconsistent processing steps across studies [1].

Data Preprocessing and Tokenization

Tokenization converts raw gene expression data into structured inputs that scFMs can process, representing a critical optimization step. Unlike natural language with inherent word order, gene expression data lacks natural sequencing, requiring strategic artificial ordering for transformer architectures.

Gene Ranking Methods: A predominant approach orders genes by expression magnitude within each cell, creating a deterministic sequence where the top-ranked genes form the input "sentence" [1]. Alternative strategies include binning genes by expression values or using normalized counts without complex ranking [1]. Comparative analyses suggest no clear advantage for overly complex ranking systems, with some models reporting robustness using simple normalized counts [1].

Token Enrichment: Beyond basic gene tokens, optimized inputs incorporate special tokens representing cell identity metadata, experimental conditions, or omics modalities [18]. Gene-level metadata such as Gene Ontology terms or chromosomal locations can provide additional biological context [1]. Batch information may be incorporated as special tokens to mitigate technical variations, though some models demonstrate robustness to batch effects without explicit batch tokens [1].

Table 2: Tokenization Strategies in scFMs

Strategy	Implementation	Advantages	Limitations
Expression-based ranking	Orders genes by expression level within each cell [1]	Deterministic, captures highly expressed features	May overlook low-expression regulatory genes
Value binning	Partitions genes into expression bins [1]	Reduces sparsity, groups genes by expression range	Loss of precise expression values
Normalized counts	Uses normalized expression without reordering [1]	Simplicity, preserves original expression relationships	May not optimize attention mechanisms
Metadata enrichment	Incorporates gene/cell metadata as special tokens [1] [18]	Provides biological context, improves interpretability	Increases model complexity

Model Architecture Choices

Transformer Architectures for Single-Cell Data

scFMs predominantly leverage transformer architectures, which utilize attention mechanisms to weight relationships between gene tokens, enabling the model to identify which genes are most informative for specific cellular identities or states [1]. The adaptation of these architectures to single-cell data requires specialized considerations to address the unique characteristics of transcriptomic information.

Encoder Architectures: Models like scBERT employ bidirectional encoder architectures based on BERT, processing all genes in a cell simultaneously to learn comprehensive contextual relationships [1] [5]. This approach excels in classification tasks such as cell type annotation and embedding generation, where full context understanding is beneficial [1].

Decoder Architectures: Models such as scGPT utilize decoder-inspired architectures with unidirectional masked self-attention, iteratively predicting masked genes conditioned on known genes [1] [18]. This design demonstrates strengths in generative tasks and perturbation prediction, where sequential generation aligns with the autoregressive approach [1].

Hybrid Designs: Emerging architectures explore encoder-decoder combinations and custom modifications to leverage benefits of both approaches [1]. While no single architecture has emerged as universally superior, each demonstrates particular strengths depending on the target application [1].

Performance Comparison Across Architectures

Rigorous benchmarking reveals distinct performance profiles across scFM architectures, enabling informed model selection based on specific application requirements. The BioLLM framework provides standardized evaluation of multiple models across diverse tasks [5].

Table 3: Architecture Performance Across Tasks (Based on BioLLM Benchmarking [5])

Model	Architecture Type	Cell Embedding Quality	Batch Effect Correction	Perturbation Prediction	Computational Efficiency
scGPT	Decoder-based [1]	Superior (ASW: 0.75-0.92) [5]	Excellent [5]	Strong [18]	High [5]
Geneformer	Encoder-based [9]	Moderate (ASW: 0.65-0.85) [5]	Moderate [5]	Strong with fine-tuning [9]	High [5]
scBERT	Encoder-based [1] [5]	Lower (ASW: 0.45-0.70) [5]	Poor [5]	Limited [5]	Lower [5]
scFoundation	Not specified	Moderate [5]	Moderate [5]	Gene-level strength [5]	Moderate [5]

scGPT consistently demonstrates superior performance in generating biologically relevant cell embeddings, achieving average silhouette width (ASW) scores of 0.75-0.92 across diverse datasets, indicating excellent separation of cell types in latent space [5]. This model also excels in batch effect correction, effectively integrating cells of the same type across experimental conditions [5]. Geneformer shows particular strength in gene-level tasks and perturbation response prediction when fine-tuned, benefiting from its effective pretraining strategy [9] [5]. In contrast, scBERT generally underperforms, likely due to smaller model size and limited training data [5].

Transfer Learning and Fine-Tuning

Protocols for Model Adaptation

Transfer learning bridges general scFM pretraining with specialized applications through two primary approaches: zero-shot inference using pretrained embeddings without additional training, and task-specific fine-tuning that updates model weights on targeted datasets [5].

Zero-Shot Inference Protocol:

Embedding Extraction: Generate cell or gene embeddings from pretrained scFM without weight updates [5]
Feature Application: Utilize embeddings as fixed features for downstream tasks (e.g., clustering, classification) [5]
Performance Validation: Assess embedding quality using metrics like ASW for biological relevance and batch effect removal [5]

Fine-Tuning Protocol:

Task Formulation: Define specific learning objective (e.g., cell state classification, perturbation response) [9]
Data Preparation: Curate task-specific dataset with appropriate labels (e.g., activated vs. resting T-cells) [9]
Model Adaptation: Update pretrained weights using limited task-specific data with reduced learning rates [9]
Validation: Evaluate on held-out test sets to assess generalization [9]

Closed-Loop Framework for Perturbation Modeling

The "closed-loop" framework represents an advanced transfer learning strategy that iteratively incorporates experimental perturbation data to enhance predictive accuracy [9] [74]. This approach significantly improves upon standard "open-loop" in silico perturbation (ISP) prediction by creating a feedback cycle between computational prediction and experimental validation [9].

Application Protocol - RUNX1-Familial Platelet Disorder:

Model Fine-Tuning: Adapt Geneformer to distinguish RUNX1-knockout hematopoietic stem cells (HSCs) from controls using scRNA-seq data [9]
In Silico Screening: Perform ISP across 13,161 genes to identify perturbations shifting RUNX1-knockout cells toward control state [9]
Target Prioritization: Select candidate genes with available small molecule inhibitors for experimental testing [9]
Therapeutic Validation: Confirm model-predicted targets (mTOR and CD74-MIF signaling axis) and novel pathways (protein kinase C and phosphoinositide 3-kinase) [9]

This closed-loop approach demonstrated substantial improvement over open-loop ISP, increasing positive predictive value from 3% to 9% in T-cell activation studies while maintaining high negative predictive value (99%), sensitivity (76%), and specificity (81%) [9]. Performance gains saturated at approximately 20 perturbation examples, indicating that even modest experimental validation can substantially enhance prediction accuracy [9].

Experimental Protocols

Benchmarking Framework for scFM Evaluation

Standardized benchmarking is essential for evaluating scFM performance across diverse biological tasks. The following protocol outlines a comprehensive evaluation framework based on established benchmarking methodologies [75] [5] [16].

Protocol 1: Comprehensive scFM Evaluation

Task Selection:
- Gene-level tasks: Tissue specificity prediction, Gene Ontology term prediction [75]
- Cell-level tasks: Batch integration, cell type annotation, cancer cell identification, drug sensitivity prediction [75]

Dataset Curation:
- Utilize diverse datasets with varying biological conditions and technical artifacts [75]
- Include cross-tissue, cross-species, and cross-technology variations to assess generalization [75]
- Implement rigorous quality control and normalization across datasets [5]
Evaluation Metrics:
- Traditional metrics: Average silhouette width (ASW), classification accuracy [5]
- Biological metrics: scGraph-OntoRWR (cell type relationship consistency), Lowest Common Ancestor Distance (LCAD) for annotation errors [75]
- Performance metrics: Positive predictive value, sensitivity, specificity for perturbation prediction [9]
Implementation:
- Utilize standardized frameworks like BioLLM for consistent model comparison [5]
- Assess both zero-shot and fine-tuned performance across tasks [5]
- Evaluate computational efficiency including memory usage and processing time [5]

Perturbation Effect Prediction Protocol

Specialized protocols for perturbation prediction enable rigorous assessment of scFM capability to simulate cellular responses to genetic and chemical interventions.

Protocol 2: PertEval-scFM Framework [16]

Data Preparation:
- Curate paired perturbation datasets (perturbed vs. unperturbed cells)
- Include diverse perturbation types (CRISPRa, CRISPRi, small molecules) and strengths
- Ensure balanced representation of different cellular states

Model Configuration:
- Extract zero-shot embeddings from multiple scFMs (scGPT, Geneformer, etc.)
- Compare against baseline methods (differential expression, simple linear models)
- Test under distribution shift conditions
Evaluation Methodology:
- Assess perturbation effect prediction accuracy using ground truth experimental data
- Measure performance across different perturbation strengths and types
- Evaluate generalization to unseen cell states and conditions
Analysis:
- Compare scFM performance against simple baseline models
- Identify failure modes under distribution shift
- Provide recommendations for model selection based on perturbation type

The Scientist's Toolkit

Successful implementation of scFMs for in silico perturbation modeling requires access to curated data resources, computational frameworks, and evaluation tools.

Table 4: Essential Research Resources for scFM Implementation

Resource Category	Specific Tools	Function	Access
Data Repositories	CZ CELLxGENE [1], Human Cell Atlas [1], GEO/SRA [1]	Provide standardized single-cell datasets for pretraining and fine-tuning	Public access
Computational Frameworks	BioLLM [5], scGPT [18], Geneformer [9]	Unified interfaces for model training, fine-tuning, and evaluation	Open source
Benchmarking Platforms	PertEval-scFM [16], Custom evaluation pipelines [75]	Standardized assessment of model performance across tasks	Open source
Specialized Models	scBERT [1], scFoundation [75], scPlantFormer [18]	Task-optimized architectures for specific applications	Open source

Implementation Considerations

Data Quality Requirements: Effective scFM implementation necessitates careful data curation addressing sparsity, batch effects, and technical noise through rigorous quality control and normalization [1] [75]. The non-sequential nature of gene expression data requires strategic tokenization approaches, with expression-based ranking or binning providing effective input structuring [1].

Computational Resources: Model selection should balance performance requirements with available resources, as scGPT and Geneformer offer favorable efficiency profiles for large-scale analyses [5]. Transfer learning strategy should align with data availability, with zero-shot inference suitable for exploratory analysis and fine-tuning essential for optimal performance on specific tasks [9] [5].

Validation Strategies: Biological relevance should be assessed through ontology-informed metrics that evaluate consistency with prior knowledge [75]. Perturbation predictions require rigorous experimental validation using orthogonal modalities to establish ground truth for model refinement [9].

Benchmarking scFM Performance: Rigorous Validation Against Simple Baselines

The advent of single-cell genomics has revolutionized our understanding of cellular heterogeneity, providing unprecedented resolution into the molecular states of individual cells. Concurrently, the rise of artificial intelligence has introduced single-cell foundation models (scFMs)—large-scale deep learning models pretrained on vast single-cell datasets—which promise to learn fundamental biological principles and generalize across diverse downstream tasks [12]. A particularly ambitious application of scFMs is in silico perturbation modeling, which aims to predict transcriptional responses to genetic perturbations without conducting costly wet-lab experiments [16]. This capability holds tremendous potential for accelerating therapeutic discovery and understanding disease mechanisms.

However, the rapid development of these models has created an urgent need for rigorous benchmarking frameworks to evaluate their predictive performance, limitations, and real-world applicability. This application note examines three key benchmarking initiatives—PertEval, PerturBench, and PEREGGRN—that provide standardized methodologies for assessing perturbation prediction capabilities in scFMs. We place special emphasis on PertEval, for which the most comprehensive benchmarking data is currently available, and discuss its implications for the field of computational biology.

PertEval-scFM: A Standardized Benchmark for Zero-Shot Prediction

PertEval-scFM is a standardized framework specifically designed to evaluate single-cell foundation models for perturbation effect prediction [17]. Its primary objective is to determine whether the contextualized representations (embeddings) learned by scFMs enhance the prediction of transcriptional changes following genetic perturbations compared to simpler baseline approaches. The benchmark operates primarily in a zero-shot setting, assessing the intrinsic capability of model embeddings without task-specific fine-tuning [17] [16].

The philosophical underpinning of PertEval is to test whether scFMs have truly learned fundamental biological principles that generalize to predicting perturbation outcomes. This approach contrasts with traditional benchmarking that might overemphasize performance on narrow tasks where models could be specifically optimized. The framework employs deliberately simple baselines to establish the minimum performance threshold that scFMs should exceed to demonstrate genuine value [11].

Experimental Protocol and Methodologies

PertEval leverages publicly available perturbation datasets that have been widely used in previous model development and validation efforts. Key datasets include:

Norman et al. data: Contains 100 individual gene and 124 pair gene upregulated perturbations in K562 cells using CRISPR activation systems, with phenotypes measured as log-transformed RNA-seq expression values for 19,264 genes [11].
Replogle et al. data: CRISPR interference datasets from K562 and RPE1 cell lines [11].
Adamson et al. data: Additional perturbation data from K562 cells [11].

Data preprocessing follows standardized quality control procedures, including normalization and filtering to ensure comparability across models. For the double perturbation benchmark, the dataset is partitioned with 100 single perturbations and 62 double perturbations used for training/fine-tuning, while the remaining 62 double perturbations are held out for testing [11].

Model Evaluation Protocol

The evaluation workflow in PertEval involves several critical steps:

Embedding Extraction: Feature embeddings are extracted from multiple scFMs (including scGPT, scFoundation, Geneformer, UCE, and scBERT) for both control and perturbed cellular states [17] [11].
Pairwise Comparison: For each perturbation, the framework compares the embeddings of perturbed and unperturbed cells using a simple model to predict the direction and magnitude of change [16].
Performance Quantification: Prediction error is calculated as the L2 distance between predicted and observed expression values, typically focusing on the 1,000 most highly expressed genes [11].
Baseline Comparison: Model performance is compared against two simple baselines:
- The "no change" model that always predicts the same expression as the control condition.
- The "additive" model that predicts the sum of individual logarithmic fold changes for double perturbations [11].

The entire evaluation process is repeated across multiple random partitions of the data to ensure statistical robustness, with results aggregated across five runs [11].

Key Findings and Quantitative Results

The PertEval benchmark has yielded several critical insights into the current capabilities and limitations of scFMs for perturbation prediction:

Performance Relative to Baselines

Across multiple evaluation scenarios, scFM embeddings did not consistently outperform simpler baseline models, especially under conditions of distribution shift [17]. The table below summarizes the comparative performance of various models against the established baselines:

Table 1: Performance comparison of scFMs against simple baselines on double perturbation prediction tasks

Model	Prediction Error (L2 Distance)	Performance vs. Additive Baseline	Genetic Interaction Prediction Accuracy
Additive Baseline	Reference	-	Not Applicable
No Change Baseline	Higher than additive	Worse	Poor (limited to buffering interactions)
scGPT	Higher than additive	Worse	Mostly predicts buffering interactions
Geneformer	Higher than additive	Worse	Rarely predicts synergistic interactions
scFoundation	Higher than additive	Worse	Limited to specific gene subsets
GEARS	Higher than additive	Worse	Mostly buffering, rarely correct synergistic
UCE	Higher than additive	Worse	Similar to no-change baseline
scBERT	Higher than additive	Worse	Similar to no-change baseline

Specific Limitations Identified

The benchmarking revealed several consistent limitations across current-generation scFMs:

Poor Performance on Strong/Atypical Perturbations: All models struggled with predicting strong or atypical perturbation effects, suggesting limited generalization to extreme transcriptional changes [17].
Inadequate Genetic Interaction Prediction: For genetic interactions (where double perturbation effects deviate from additive expectations), most models performed no better than the "no change" baseline and predominantly predicted buffering interactions while rarely correctly identifying synergistic effects [11].
Limited Extrapolation to Unseen Perturbations: In benchmarks assessing prediction of unseen perturbations, models consistently failed to outperform simple linear models or even the mean prediction baseline [11].
Embedding Utility: When gene embeddings from scFMs were used in simple linear models rather than their native architectures, they performed as well or better than the original models, suggesting the problem lies in how embeddings are utilized rather than their intrinsic quality [11].

Complementary Benchmarking Frameworks

PerturBench: Assessing Generalization Across Cellular Contexts

While comprehensive data on PerturBench was limited in the current search results, this benchmarking framework is designed to evaluate how well perturbation prediction models generalize across diverse cellular contexts and experimental conditions. It typically incorporates datasets from multiple cell types and perturbation modalities to assess cross-context transfer learning capabilities.

PEREGGRN: Focusing on Gene Regulatory Network Inference

PEREGGRN specializes in benchmarking models for gene regulatory network inference from single-cell data. This framework addresses the distinct challenge of reconstructing directed regulatory relationships between genes, particularly transcription factors and their targets [76]. Benchmarking in this domain requires specialized ground truth networks derived from experimental data such as ChIP-seq, CRISPR perturbations, and carefully curated databases like RegulonDB [76].

Research Reagent Solutions: Essential Tools for Perturbation Modeling

The following table details key computational tools and data resources essential for conducting rigorous benchmarking of perturbation prediction models:

Table 2: Essential research reagents and computational tools for perturbation modeling benchmarking

Resource Type	Specific Examples	Function/Application
Single-Cell Foundation Models	scGPT, scFoundation, Geneformer, UCE, scBERT [11] [12]	Generate contextualized embeddings of single-cell states for prediction tasks
Benchmarking Frameworks	PertEval-scFM, BEELINE [17] [77]	Standardized evaluation pipelines and performance metrics
Ground Truth Datasets	Norman et al. (CRISPRa), Replogle et al. (CRISPRi), Adamson et al. [11]	Experimentally validated perturbation data for training and testing
Baseline Models	Additive model, No-change model, Linear models [11]	Simple reference points for establishing minimum performance thresholds
Gene Regulatory Networks	STRING, RegulonDB, Cell-type-specific ChIP-seq [76] [77]	Curated molecular interaction data for validation of regulatory predictions
Specialized Architectures	1DCNN-GRU hybrids, Graph Neural Networks, Transformers [78] [12] [77]	Advanced model architectures for capturing spatial and temporal dependencies

Integrated Experimental Protocol for Comprehensive Benchmarking

Based on the methodologies employed across these benchmarking initiatives, we propose the following integrated protocol for rigorous evaluation of perturbation prediction models:

Data Acquisition and Curation

Dataset Collection: Obtain diverse perturbation datasets spanning multiple cell types (e.g., K562, RPE1), perturbation technologies (CRISPRa, CRISPRi), and perturbation types (single gene, double gene).
Quality Control: Apply consistent filtering to remove low-quality cells and genes, using thresholds such as excluding genes expressed in fewer than 10% of cells [77].
Normalization: Implement appropriate normalization methods (e.g., logarithmic transformation) to address technical variability [77].
Data Partitioning: Split data into training, validation, and test sets, ensuring that specific perturbations are held out for testing generalization to unseen conditions.

Model Configuration and Training

Embedding Extraction: For foundation models, extract cell and gene embeddings using the native architectures without additional fine-tuning for zero-shot evaluation [17].
Baseline Implementation: Implement simple baseline models including the "no change" predictor and additive model for double perturbations [11].
Architecture-Specific Setup: For specialized models like the 1DCNN-GRU hybrid, configure architecture parameters as described in the original publications [78].
Training Regimen: For models requiring training, use standardized procedures with early stopping based on validation performance to prevent overfitting [77].

Evaluation and Analysis

Metric Calculation: Compute multiple performance metrics including L2 distance, Pearson correlation, and precision-recall for specific interaction types [11].
Error Analysis: Investigate specific failure modes, such as performance on strong perturbations or atypical transcriptional responses [17].
Comparative Statistics: Perform statistical testing to determine significant differences between model performances across multiple random seeds [11].

The current benchmarking landscape for in silico perturbation modeling reveals significant gaps between the promised capabilities of single-cell foundation models and their actual performance on predictive tasks. The consistent finding that simple baselines remain competitive with—and often outperform—sophisticated scFMs underscores the immaturity of this field and highlights the need for more biologically-grounded architectures and training approaches [17] [11].

Future development should focus on several key areas: (1) creating higher-quality datasets that capture a broader range of cellular states and perturbation strengths [17], (2) developing specialized model architectures that explicitly incorporate biological knowledge about gene regulatory networks [77], and (3) establishing more nuanced benchmarking frameworks that test specific biological capabilities beyond aggregate performance metrics. As these improvements materialize, rigorous benchmarking through initiatives like PertEval will remain essential for guiding progress toward truly predictive in silico models of cellular behavior.

The application of single-cell foundation models (scFMs) to predict gene expression changes following genetic perturbations represents a frontier in computational biology, with significant implications for drug development and basic research. These models, pre-trained on millions of single-cell transcriptomes, promise to serve as "virtual cells" for in silico experimentation, potentially reducing the need for costly and labor-intensive laboratory screens [9] [1]. However, a growing body of rigorous, comparative benchmarking studies reveals a striking consensus: sophisticated scFMs frequently fail to outperform deliberately simple baseline models, such as linear predictors and mean expression models, in predicting perturbation effects [11] [17]. This application note synthesizes critical findings from recent benchmarks, providing researchers with structured data and validated protocols to navigate this rapidly evolving field.

The performance gap between complex scFMs and simple baselines is consistent across diverse experimental contexts. A landmark study published in Nature Methods directly compared five foundation models and two other deep learning models against simple baselines for predicting transcriptome changes after single or double genetic perturbations. The study concluded that "none outperformed the baselines," highlighting a significant challenge for the field [11]. Similarly, the PEREGGRN benchmarking platform, which evaluates methods across 11 large-scale perturbation datasets, found that "it is uncommon for expression forecasting methods to outperform simple baselines" [79]. These findings underscore the importance of critical benchmarking in directing and evaluating methodological development, especially as scFMs are increasingly applied to prioritize therapeutic targets for conditions like RUNX1-familial platelet disorder and T-cell activation [9].

Quantitative Performance Comparison

Comprehensive benchmarking across multiple datasets and experimental setups provides a clear, quantitative picture of the current capabilities and limitations of scFMs for perturbation prediction. The table below summarizes key performance metrics from major benchmarking studies, comparing scFMs against simple baseline models.

Table 1: Performance Summary of scFMs vs. Simple Baselines in Perturbation Prediction

Benchmark Task	Top-Performing scFM	Best Simple Baseline	Performance Comparison	Key Metric	Dataset(s)
Double Perturbation Prediction	scGPT	Additive Model (Sum of LFCs)	scFM error substantially higher [11]	L2 Distance (Top 1k genes)	Norman et al. [11]
Unseen Single Perturbation Prediction	Geneformer (with linear decoder)	Linear Model with Pretrained P	No consistent improvement over baseline [11]	L2 Distance	Replogle et al. (K562, RPE1) [11]
Genetic Interaction Identification	Various scFMs	No Change Baseline	No model better than baseline [11]	True-Positive Rate vs. FDP	Norman et al. [11]
T-cell Activation Prediction (Open-loop)	Geneformer	Differential Expression (DE)	Superior NPV (98% vs 78%) and specificity (60% vs 50%), but same low PPV (3%) [9]	Predictive Values	Orthogonal Flow Cytometry [9]
T-cell Activation Prediction (Closed-loop)	Fine-tuned Geneformer	Open-loop ISP	3x increase in PPV (3% to 9%) with improved sensitivity/specificity [9]	Positive Predictive Value	Perturb-seq in Primary T-cells [9]
Zero-shot Perturbation Effect Prediction	Multiple scFMs (Geneformer, scGPT)	Simple Baseline Models	scFM embeddings provided no consistent improvement, especially under distribution shift [17]	Multiple Metrics	PertEval-scFM Benchmark [17]

A critical insight from these benchmarks is that even models explicitly designed for perturbation prediction, such as GEARS, scGPT, and scFoundation, struggle to surpass the predictive accuracy of simple models. The "additive model," which predicts double perturbation effects by summing the logarithmic fold changes of individual perturbations, consistently outperformed deep learning models. Similarly, a simple linear model or even predicting the mean expression across training perturbations often proved more effective and computationally efficient than fine-tuning large foundation models [11]. Furthermore, a specialized "closed-loop" fine-tuning approach, which incorporates experimental Perturb-seq data into the model training cycle, demonstrated that scFM performance can be significantly improved. This method achieved a three-fold increase in positive predictive value for T-cell activation, suggesting a viable path for enhancing scFM utility [9].

Detailed Experimental Protocols

To ensure reproducibility and facilitate independent validation of these findings, this section outlines detailed protocols for the key benchmarking experiments cited.

Protocol 1: Benchmarking Double Perturbation Predictions

This protocol is adapted from the benchmark performed on the dataset from Norman et al., as detailed in [11].

I. Experimental Preparation and Reagents

Perturbation Dataset: Utilize the single-cell CRISPRa dataset from Norman et al., which includes 100 single-gene and 124 double-gene perturbations in K562 cells.
Software Environment: Configure a Python environment with libraries for deep learning (PyTorch/TensorFlow) and single-cell analysis (Scanpy). Specific versions of model code (e.g., scGPT, GEARS, scFoundation) are required for exact replication.
Computational Resources: Ensure access to high-performance computing resources with multiple GPUs (e.g., NVIDIA A100 or V100) for feasible model fine-tuning.

II. Data Preprocessing

Data Splitting: Partition the 124 double perturbations into five random folds, using 62 for training and 62 for testing in each fold. The 100 single perturbations are included in all training sets.
Gene Filtering: For the primary evaluation metric, filter the gene expression matrix to include the 1,000 most highly expressed genes across all cells.
Expression Normalization: Apply standard scRNA-seq normalization, including total count normalization and logarithmic transformation (e.g., log1p).

III. Model Training and Baselines

Fine-tune scFMs: Fine-tune each foundation model (e.g., scGPT, Geneformer) on the combined training set of single and double perturbations according to their respective published protocols.
Implement Baselines:
- No Change Model: For any perturbation, predict the control condition's expression profile.
- Additive Model: For a double perturbation (A,B), predict the expression as control_expression + LFC_A + LFC_B, where LFC is the mean logarithmic fold change from the single perturbations in the training data.

IV. Performance Evaluation

Calculate L2 Distance: For each held-out double perturbation, compute the L2 distance between the predicted and observed expression values for the top 1,000 genes. Aggregate results across all five test folds.
Evaluate Genetic Interactions: Identify genetic interactions in the ground truth data where the double perturbation phenotype significantly deviates from the additive expectation. Plot the true-positive rate against the false discovery proportion for interactions predicted by each model.

Protocol 2: Closed-loop Fine-tuning for Enhanced Prediction

This protocol is based on the methodology described in [9] for improving T-cell activation predictions.

I. Experimental Preparation

Base Model: Obtain the pre-trained Geneformer-30M-12L model.
Datasets:
- Activation Status Data: Collate scRNA-seq data from CD3-CD28 stimulated and resting T-cells from multiple studies.
- Perturbation Data: Obtain Perturb-seq data from primary human T-cells involving CRISPRa/i screens of 75 genes. The data should be labeled with cellular activation status but not the specific gene perturbed.

II. Initial Fine-tuning (Open-loop)

Task Formulation: Fine-tune Geneformer on the activation status data to classify cells as "resting" or "activated."
Validation: Confirm the model achieves high accuracy (>99%) on a held-out test set of cells.

III. Closed-loop Fine-tuning

Data Integration: Combine the original activation status dataset with the Perturb-seq dataset.
Model Fine-tuning: Continue fine-tuning the model from step II on this combined dataset. The learning objective remains the classification of activation status, but the model is now exposed to the effects of genetic perturbations within this context.
Performance Saturation Testing: To determine the minimal number of perturbation examples needed, iteratively fine-tune the model with random subsets of the perturbation data (e.g., 10, 20, 30 examples) and evaluate performance metrics.

IV. In Silico Perturbation (ISP) and Evaluation

Run ISP: Use the fine-tuned model to perform in silico knockout and overexpression of all genes not included in the Perturb-seq data.
Validate Predictions: Compare ISP predictions against orthogonal flow cytometry data measuring IL-2 and IFN-γ production. Calculate Positive Predictive Value (PPV), Negative Predictive Value (NPV), sensitivity, and specificity.

Diagram 1: The closed-loop fine-tuning workflow enhances scFM performance by integrating experimental perturbation data.

Visualization of Key Concepts and Workflows

To clarify the logical relationships and structural differences between model types and benchmarking outcomes, the following diagrams are provided.

Diagram 2: Simplified architecture comparison of an scFM versus a simple linear baseline for perturbation prediction.

Diagram 3: Decision workflow for choosing a perturbation prediction strategy, based on benchmarking results.

The Scientist's Toolkit: Essential Research Reagents and Materials

Successful execution of the protocols and interpretation of benchmarking studies require familiarity with key computational tools and data resources. The following table catalogs essential components of the in silico perturbation modeling workflow.

Table 2: Key Research Reagents and Computational Tools for scFM Perturbation Research

Item Name	Type	Primary Function in Research	Example/Source
Perturbation Datasets	Biological Data	Provides ground truth data for training and benchmarking models.	Norman et al. (CRISPRa), Replogle et al. (CRISPRi), Adamson et al. (UPR genes) [79] [11]
Single-cell Foundation Models (scFMs)	Pre-trained Model	Encodes prior biological knowledge from vast scRNA-seq atlases; base for fine-tuning.	Geneformer, scGPT, scFoundation, UCE [11] [1]
Benchmarking Platforms	Software Framework	Standardizes evaluation of different models across datasets and tasks.	PEREGGRN [79], PertEval-scFM [17]
Simple Baseline Models	Algorithm	Provides a critical performance baseline (e.g., additive, linear, mean predictor).	Additive Model, Linear Model (Y ≈ G x W x Pᵀ), Mean Predictor [11]
Gene Embeddings	Data Representation	Vector representations of genes learned by models; can be used in linear predictors.	Extracted from scFoundation or scGPT [11]
Perturbation Embeddings	Data Representation	Vector representations of perturbation effects; can be pre-trained on related data.	Extracted from GEARS or learned from data [11]

In the field of in silico perturbation modeling with single-cell foundation models (scFMs), the accurate assessment of model performance is paramount. Predicting transcriptional responses to genetic perturbations represents a core challenge in functional genomics, with significant implications for revealing gene functions, mapping regulatory networks, and accelerating therapeutic discovery [80]. As the space of possible perturbations is combinatorially complex, computational approaches have been developed to predict transcriptional outcomes of genetic perturbations that were never experimentally tested. The evaluation of these models relies heavily on specific performance metrics that quantify how well predictions match experimental observations.

Recent benchmarking studies have revealed surprising insights about metric performance and interpretation. Simple baseline models—including those that predict the average expression across all perturbed cells (perturbed mean) or the average of matched post-perturbation profiles for combinatorial perturbations (matching mean)—often perform comparably to or even outperform state-of-the-art foundation models like scGPT and scFoundation across multiple datasets [80] [81]. This phenomenon has been largely attributed to systematic variation in perturbation datasets, which represents consistent transcriptional differences between perturbed and control cells arising from selection biases or biological confounders [80]. These findings underscore the critical importance of selecting metrics that can distinguish true perturbation-specific effects from systematic biases.

This protocol explores the theoretical foundations, practical applications, and limitations of the primary metrics used for evaluating perturbation prediction models, with particular emphasis on their implementation within scFM research.

Theoretical Foundations of Key Metrics

Root Mean Square Error (RMSE)

RMSE is defined as the square root of the average squared differences between predicted and observed values. For a sample of n observations y (y_i) and corresponding model predictions ŷ, the RMSE is calculated as:

This metric represents the standard deviation of the prediction errors (residuals), providing a measure of how concentrated the data is around the line of best fit [82]. RMSE is expressed in the same units as the predicted variable, facilitating intuitive interpretation. The theoretical justification for RMSE stems from maximum likelihood estimation, where it is optimal for normally distributed (Gaussian) errors [83]. In perturbation modeling, it penalizes large errors more heavily than small errors due to the squaring of each term, making it particularly sensitive to outliers.

Rank Correlation Coefficients

Spearman's rank correlation coefficient (Spearman's ρ) measures how well the relationship between two variables can be described using a monotonic function, whether linear or not [84]. It assesses how similar the ranks of observations are between two variables. The coefficient is calculated as:

where rgX and rgY are the rank variables of the predicted and ground truth values, cov denotes covariance, and σ represents standard deviation [85]. For data without tied ranks, a simplified formula exists:

where d_i is the difference between the two ranks of each observation and n is the sample size [84]. This nonparametric measure is appropriate for both continuous and discrete ordinal variables, making it valuable for assessing whether a model correctly captures the relative ordering of gene expression changes following perturbations, which is often more biologically meaningful than exact numerical predictions.

Metric Selection Framework

The choice between RMSE and rank-based metrics should be informed by the error distribution characteristics and research objectives:

Table 1: Metric Selection Guide Based on Error Distribution and Research Goals

Error Distribution	Optimal Metric	Theoretical Justification	Perturbation Modeling Context
Normal (Gaussian)	RMSE	Maximum likelihood estimator for normal errors [83]	Appropriate for technical replicates with well-controlled experimental conditions
Laplace (heavy-tailed)	MAE	Maximum likelihood estimator for Laplacian errors [83]	Better for data with occasional large prediction errors or outliers
Unknown or complex	Spearman's ρ	Nonparametric; assesses monotonic relationships without distributional assumptions [84] [85]	Preferred for evaluating ranking of differential expression effects

Benchmarking Insights and Metric Limitations

Systematic Variation in Perturbation Datasets

A critical challenge in perturbation modeling evaluation is the presence of systematic variation—consistent transcriptional differences between perturbed and control cells that arise from selection biases in the perturbation panel or underlying biological confounders [80]. This variation can profoundly impact metric interpretation:

Dataset-specific biases: In the Adamson dataset, which targets genes involved in endoplasmic reticulum homeostasis, systematic differences appear in pathways related to response to external stimuli and chemical stress [80]. Similarly, the Norman dataset, focusing on cell cycle and growth genes, shows systematic variation in cell death activation and stress response downregulation [80].
Biological confounders: In the Replogle RPE1 dataset, significant differences in cell-cycle distribution exist between perturbed and control cells (46% of perturbed cells vs. 25% of control cells in G1 phase), attributed to widespread chromosomal instability induced by perturbations [80].
Metric susceptibility: Standard reference-based metrics are highly susceptible to these systematic differences, potentially leading to overoptimistic performance assessments when models primarily capture average perturbation effects rather than perturbation-specific biology [80].

Comparative Performance of Metrics in scFM Evaluation

Recent benchmarking studies have revealed substantial discrepancies in model rankings depending on the chosen evaluation metric:

Table 2: Comparative Performance of Models and Baselines Across Metrics and Datasets

Dataset	Model	PearsonΔ	PearsonΔ20	RMSE	Spearman's ρ
Adamson	Train Mean	0.711	-	-	-
Adamson	scGPT	0.641	-	-	-
Adamson	RF with GO features	0.739	-	-	-
Norman	Train Mean	0.557	-	-	-
Norman	scGPT	0.554	-	-	-
Norman	RF with GO features	0.586	-	-	-
Replogle K562	Train Mean	0.373	-	-	-
Replogle K562	scGPT	0.327	-	-	-
Replogle K562	RF with GO features	0.480	-	-	-
Generic evaluation	CPA	Variable	Variable	Variable	-
Generic evaluation	GEARS	Variable	Variable	Variable	-

Data adapted from benchmark studies [80] [81]. PearsonΔ represents correlation in differential expression space.

Unexpectedly, the simple Train Mean baseline consistently matches or exceeds the performance of sophisticated foundation models like scGPT and scFoundation across multiple datasets when evaluated using Pearson correlation in differential expression space (PearsonΔ) [81]. Similarly, in predicting combinatorial perturbation responses, the matching mean baseline outperformed all other methods by considerable margins (11% improvement for PearsonΔ over the best alternative method) [80].

The Systema Framework for Bias-Aware Evaluation

To address metric limitations, the Systema framework has been developed specifically for evaluating genetic perturbation response prediction beyond systematic variation [80]. This framework introduces two key advances:

Mitigation of systematic biases by focusing on perturbation-specific effects rather than overall expression changes
Interpretable readout of methods' ability to reconstruct the perturbation landscape, differentiating predictions that merely replicate systematic effects from those capturing biologically informative perturbation responses

The framework implementation is available on GitHub (https://github.com/mlbio-epfl/systema) and provides more biologically meaningful assessment of perturbation response modeling [80].

Experimental Protocols for Metric Implementation

Standard Evaluation Workflow for scFM Perturbation Prediction

The following workflow diagram illustrates the comprehensive evaluation process for perturbation prediction models:

Diagram Title: scFM Perturbation Evaluation Workflow

Protocol: Comprehensive Metric Calculation

Data Preparation and Preprocessing

Dataset Selection and Partitioning
- Select perturbation datasets with appropriate scale and experimental design (e.g., Adamson, Norman, or Replogle datasets)
- Partition data using Perturbation Exclusive (PEX) splitting, ensuring perturbations in the test set are completely unseen during training [81]
- Maintain separate control cell profiles for differential expression calculation
Baseline Model Implementation
- Implement Train Mean baseline: calculate average pseudo-bulk expression profiles across all training perturbations
- Implement Matching Mean baseline for combinatorial perturbations: average the centroids of matching single-gene perturbations [80]
- Ensure baselines use identical data preprocessing as evaluated models

Prediction Generation and Processing

Model Inference
- Generate predicted expression profiles for all test set perturbations
- Process both single-gene and combinatorial perturbations if applicable
- Maintain single-cell resolution in initial predictions
Pseudo-bulk Creation
- Average predicted expression profiles across cells for each perturbation to create pseudo-bulk profiles
- Apply identical averaging to ground truth data
- This step reduces noise and enables robust correlation calculations [81]

Differential Expression Calculation

Reference-based Differential Expression
- Calculate Δ(predicted) = mean(predicted perturbed) - mean(control)
- Calculate Δ(ground truth) = mean(actual perturbed) - mean(control)
- Use identical control cells for both calculations
Gene Selection for Evaluation
- For comprehensive assessment: use all genes
- For focused assessment: select top 20 differentially expressed genes based on ground truth [80] [81]
- Consider pathway-specific gene panels for biologically targeted evaluation

Metric Computation Procedure

RMSE Calculation
Spearman's Rank Correlation Calculation
Pearson Correlation in Differential Expression Space
- Calculate PearsonΔ = Pearson correlation between Δ(ground truth) and Δ(predicted) across all genes
- Calculate PearsonΔ20 = Same calculation restricted to top 20 differentially expressed genes [80]

Systematic Variation Assessment

Pathway Enrichment Analysis
- Perform Gene Set Enrichment Analysis (GSEA) between control and perturbed cells
- Use AUCell to score pathway activity in single cells [80]
- Identify systematically enriched pathways that might confound metric interpretation
Cell State Distribution Analysis
- Compare distribution of cells across cell-cycle phases between perturbed and control populations
- Calculate Jensen-Shannon divergence and chi-squared test for significance [80]
- Assess other potential confounders (e.g., stress response signatures)

Protocol: Advanced Evaluation with Systema Framework

Installation and Setup
- Clone the Systema repository: git clone https://github.com/mlbio-epfl/systema
- Follow installation instructions in documentation
- Prepare data in compatible format (AnnData)
Perturbation-Specific Effect Isolation
- Apply Systema's bias correction algorithms to isolate perturbation-specific effects
- Quantify the degree of systematic variation in the dataset
- Recalculate metrics on corrected effects
Perturbation Landscape Reconstruction Assessment
- Evaluate method's ability to reconstruct the overall perturbation landscape
- Assess whether perturbations targeting functionally coherent gene groups cluster appropriately in predicted space
- Calculate additional landscape-based metrics provided by the framework

Table 3: Key Research Reagents and Computational Tools for Perturbation Metric Evaluation

Resource Name	Type	Function in Evaluation	Implementation Notes
Systema [80]	Evaluation Framework	Isolates perturbation-specific effects from systematic variation	GitHub: mlbio-epfl/systema; Requires AnnData format
Perturbation Datasets [80] [81]	Experimental Data	Benchmark model performance	Adamson, Norman, Replogle datasets; PEX splitting recommended
Train Mean Baseline [80] [81]	Baseline Model	Simple benchmark for average perturbation effects	Average of training pseudo-bulk profiles
Matching Mean Baseline [80]	Baseline Model	Benchmark for combinatorial perturbations	Average of matching single-gene perturbation centroids
AUCell [80]	Analysis Tool	Scores pathway activity in single cells	Identifies systematically enriched pathways
GSEA [80]	Analysis Method	Gene set enrichment analysis	Detects pathway-level systematic variation
scGPT [80] [81]	Foundation Model	Benchmark complex model architecture	Fine-tune on perturbation data
GEARS [80]	Prediction Method	Benchmark model using biological networks	Incorporates prior knowledge
Random Forest with GO features [81]	Baseline Model	Biologically-informed baseline	Uses Gene Ontology vectors as features

Interpretation Guidelines and Reporting Standards

Metric Interpretation Framework

The relationship between error distributions and metric appropriateness can be visualized as follows:

Diagram Title: Metric Selection Based on Error Distribution

Comprehensive Reporting Standards

When reporting perturbation prediction results, include these essential elements:

Baseline Comparison: Always report performance of simple baselines (Train Mean, Matching Mean) alongside model results [80] [81]
Multiple Metric Perspective: Report both RMSE and rank correlation metrics to provide complementary views of performance
Systematic Variation Assessment: Document the extent of systematic variation in datasets and its potential impact on metrics [80]
Statistical Significance: Include confidence intervals or p-values for correlation metrics to distinguish meaningful differences from random variation
Biological Validation: Where possible, supplement quantitative metrics with biological validation of predicted perturbation effects

The evaluation of perturbation prediction models requires careful metric selection informed by both statistical principles and biological considerations. The recent discovery that simple baselines can outperform sophisticated foundation models underscores the limitations of current evaluation approaches and the pervasive influence of systematic variation in perturbation datasets [80] [81].

The introduction of bias-aware evaluation frameworks like Systema represents significant progress toward more biologically meaningful assessment [80]. Future developments should focus on creating metrics that better capture a model's ability to predict functionally relevant perturbation effects rather than merely recapitulating systematic biases. As the field advances, standardized evaluation protocols incorporating these insights will be essential for meaningful comparison of perturbation prediction methods and translation of computational predictions to biological discovery and therapeutic development.

A principal ambition in the development of single-cell Foundation Models (scFMs) is their capacity for zero-shot prediction—the ability to accurately forecast the effects of genetic or chemical perturbations without task-specific fine-tuning. This capability is considered a critical benchmark for true biological understanding within these models. The rationale is that a model which has internalized the fundamental rules of cell biology from its pretraining data should be able to generalize its knowledge to novel experimental conditions, including unseen perturbations. Such a capability would revolutionize drug discovery and functional genomics by enabling in silico screening of perturbation outcomes, drastically reducing experimental costs and time. However, recent rigorous benchmarking studies have revealed a significant gap between this ambition and current model capabilities, showing that scFMs often fail to outperform deliberately simple baselines on perturbation prediction tasks [11].

The core challenge lies in the models' ability to move beyond pattern recognition in their training data to genuine mechanistic reasoning about novel perturbations. This application note synthesizes current evidence on the zero-shot generalization capacities of leading scFMs, providing structured experimental protocols and benchmarks to guide their evaluation in perturbation modeling tasks. By establishing standardized assessment frameworks, we aim to facilitate more meaningful comparisons across models and accelerate progress toward truly generalizable perturbation prediction systems.

Current Landscape of Single-Cell Foundation Models

The field of single-cell Foundation Models has rapidly diversified, with multiple architectures employing different pretraining strategies and learning objectives. Understanding these foundational differences is crucial for interpreting their varied performance on zero-shot perturbation tasks. These models are predominantly built on transformer architectures and learn from vast single-cell RNA sequencing corpora, but they diverge significantly in their approach to representing biological information [1] [12].

Table 1: Key Single-Cell Foundation Models and Their Architectures

Model	Architecture Type	Parameters	Pretraining Data Scale	Key Innovation
scGPT	Decoder-only Transformer	50 million	33 million cells	Generative pretraining with gene expression prediction [4]
Geneformer	Encoder-only Transformer	40 million	30 million cells	Rank-based gene tokenization; mechanistic network learning [1]
scFoundation	Asymmetric encoder-decoder	100 million	50 million cells	Read-depth-aware masked gene modeling [10]
scBERT	Bidirectional Encoder	Not specified	Millions of cells	Early transformer adaptation for single-cell data [1]
UCE	Encoder with protein embeddings	650 million	36 million cells	Incorporates protein sequence information via ESM-2 embeddings [10]
LPM (Large Perturbation Model)	Decoder-only with disentangled conditioning	Not specified	Heterogeneous perturbation data	Explicit disentanglement of Perturbation, Readout, and Context (PRC) [4]

A critical differentiator among these models is their tokenization strategy—how they convert gene expression data into sequences that transformers can process. Most models represent individual genes as tokens, but they employ different methods for handling expression values and gene ordering. Some models like Geneformer rank genes by expression level to create input sequences, while others like scGPT use value binning or projection techniques [1] [12] [10]. The recently proposed Large Perturbation Model (LPM) introduces a novel approach by explicitly disentangling the representation of perturbations, readouts, and experimental contexts, allowing it to integrate more heterogeneous perturbation data across different modalities [4].

Benchmarking Zero-Shot Performance on Perturbation Tasks

Performance Relative to Simple Baselines

Recent comprehensive benchmarking studies have yielded sobering results regarding the zero-shot perturbation prediction capabilities of current scFMs. A landmark study published in Nature Methods in 2025 compared five foundation models and two other deep learning models against deliberately simple baselines for predicting transcriptome changes after single or double genetic perturbations [11]. The benchmarks assessed models on their ability to predict expression changes following double perturbations in K562 cells using data from Norman et al., with models trained on single perturbations and a subset of double perturbations then evaluated on held-out double perturbations.

The results revealed that no deep learning model outperformed a simple additive baseline that predicts the sum of individual logarithmic fold changes without using any double perturbation data during training. All scFMs had substantially higher prediction error (L2 distance between predicted and observed expression values) than this simplistic approach. Similarly, for predicting genetic interactions—where the phenotype of simultaneous perturbations differs unexpectedly from additive effects—none of the models performed better than a "no change" baseline that always predicts control condition expression [11].

Table 2: Benchmarking Results on Double Perturbation Prediction Tasks

Model Category	Representative Models	Performance on Double Perturbation	Performance on Genetic Interaction Prediction	Key Limitations
Simple Baselines	Additive model, No-change model	Reference standard	No-change model competitive with complex models	Biologically simplistic but surprisingly effective
Specialized Perturbation Models	GEARS, CPA	Worse than additive baseline	Not better than no-change baseline	Limited generalization beyond training data
General scFMs	scGPT, Geneformer, scFoundation, scBERT	Worse than additive baseline	Not better than no-change baseline; rarely predict synergistic interactions correctly	Struggle to capture nonlinear interaction effects
New Architectures	LPM	State-of-the-art on some tasks but limited zero-shot tests	Shows promise but requires rigorous benchmarking	Limited evaluation on true zero-shot scenarios

These findings suggest that the goal of creating foundation models that provide generalizable representations of cellular states capable of predicting the outcome of not-yet-performed experiments "remains elusive" with current approaches [11]. The models particularly struggled to predict synergistic interactions, with most predominantly predicting buffering interactions and rarely making correct predictions of true synergistic effects.

Performance on Unseen Perturbation Prediction

The ability to predict effects of completely unseen perturbations represents an even greater challenge and more rigorous test of zero-shot capabilities. In benchmarks assessing prediction of single gene perturbation effects across different cell lines (K562 and RPE1), no deep learning model consistently outperformed even simpler baselines, including a linear model with carefully constructed embeddings or simply predicting the mean expression across training perturbations [11].

Notably, when researchers extracted gene embeddings from scFoundation and scGPT and used them in a simple linear model, this approach performed as well as or better than the native implementations of scGPT and GEARS with their built-in decoders. However, these embedding-enhanced linear models still did not consistently outperform linear models using embeddings derived directly from the perturbation data itself [11].

The most effective approach identified was a linear model with perturbation representations pretrained on relevant perturbation data, suggesting that pretraining on perturbation data specifically may be more valuable than pretraining on general single-cell atlases alone. This finding questions whether the current paradigm of pretraining on broad single-cell corpora is optimal for perturbation prediction tasks.

Experimental Protocols for Assessing Zero-Shot Capabilities

Protocol 1: Double Perturbation Generalization Benchmark

Objective: Evaluate model capability to predict transcriptomic effects of unseen double gene perturbations after training on single perturbations and a subset of double perturbations.

Datasets:

Primary dataset: Norman et al. CRISPR activation data from K562 cells (100 individual genes and 124 pairs of genes) [11]
Expression data: Logarithm-transformed RNA sequencing values for 19,264 genes

Experimental Design:

Data Partitioning: Fine-tune models on all 100 single perturbations and 62 randomly selected double perturbations (approximately 50% of available pairs)
Testing: Assess prediction error on the remaining 62 held-out double perturbations
Robustness Measures: Repeat analysis five times using different random partitions with appropriate statistical testing

Evaluation Metrics:

Primary metric: L2 distance between predicted and observed expression values for the 1,000 most highly expressed genes
Secondary metrics: Pearson correlation delta measure, L2 distances for various gene subsets (most highly expressed or most differentially expressed genes)
Genetic interaction prediction: True-positive rate and false discovery proportion across threshold variations

Baseline Comparisons:

"No change" baseline: Always predicts control condition expression
"Additive" baseline: Predicts sum of individual logarithmic fold changes without using double perturbation data

Implementation Considerations:

Ensure consistent preprocessing and normalization across all models
Use standardized train/validation/test splits to enable cross-study comparisons
Report computational requirements and training time for each model

Protocol 2: Cross-Cell Line Unseen Perturbation Transfer

Objective: Assess model capability to generalize perturbation effects across cellular contexts without fine-tuning.

Datasets:

Replogle et al. CRISPRi datasets (K562 and RPE1 cells) [11]
Adamson et al. CRISPR perturbation data (K562 cells) [11]

Experimental Design:

Training: Train models on perturbation data from one cell line (e.g., K562)
Testing: Evaluate prediction accuracy on unseen perturbations in a different cell line (e.g., RPE1)
Ablation: Compare performance when using different proportions of target cell line data for adaptation

Evaluation Metrics:

Mean squared error between predicted and observed expression changes
Correlation between predicted and actual differential expression patterns
Precision in identifying top differentially expressed genes

Baseline Comparisons:

Simple mean prediction across training perturbations
Linear models with various embedding strategies
Domain adaptation approaches specifically designed for cross-cell-line transfer

Protocol 3: Novel Compound Mechanism Identification

Objective: Evaluate model capability to predict mechanisms of action for novel chemical compounds based on structural or functional similarity to training compounds.

Datasets:

Library of Integrated Network-Based Cellular Signatures (LINCS) data combining genetic and pharmacological perturbations across multiple cellular contexts [4]

Experimental Design:

Leave-Out Classes: Remove all compounds sharing specific mechanisms of action or targeting specific pathways during training
Testing: Assess ability to correctly cluster novel compounds with known compounds sharing mechanisms despite structural differences
Evaluation: Quantify distance in embedding space between novel compounds and their known targets or mechanisms

Evaluation Metrics:

Cluster purity and separation in perturbation embedding space
Recovery of known drug-target interactions
Accuracy in predicting primary mechanisms of action for held-out compounds

Visualization of Benchmarking Workflows and Model Architectures

Zero-Shot Evaluation Workflow for scFMs

Architectural Comparison for Perturbation Prediction

Table 3: Key Computational Tools and Frameworks for scFM Perturbation Research

Tool/Resource	Type	Primary Function	Application in Perturbation Studies
BioLLM	Standardized framework	Unified interface for diverse scFMs	Enables consistent benchmarking across models with standardized APIs [86] [5]
GEARS	Specialized perturbation model	Predicts effects of single and double gene perturbations	Baseline for genetic perturbation prediction tasks [11]
CPA	Compositional perturbation autoencoder	Predicts effects of perturbation combinations and dosages	Handles drug combinations and dose-response relationships [11]
LPM	Large perturbation model	Integrates heterogeneous perturbation experiments	Cross-modal perturbation prediction (genetic + chemical) [4]
CELLxGENE	Data repository	Curated single-cell datasets	Source of standardized training and benchmarking data [1] [12]
Norman et al. Dataset	Benchmark dataset	CRISPR activation perturbation data	Gold standard for double perturbation benchmarking [11]
Replogle et al. Dataset	Benchmark dataset	CRISPRi perturbation data across cell lines	Evaluation of cross-cell-line generalization [11]

Interpretation Guidelines and Limitations

When interpreting zero-shot perturbation prediction results, several critical considerations emerge from recent benchmarking studies:

Performance relative to simple baselines: The fact that simple additive models or mean predictors remain competitive with sophisticated scFMs suggests that current models may not be capturing higher-order biological interactions as effectively as hoped [11]. This performance gap should be openly acknowledged when presenting results.
Task-specific strengths: No single model consistently outperforms others across all perturbation tasks. For example, while scGPT demonstrates robust performance across multiple tasks in some benchmarks, other models like Geneformer and scFoundation show specialized strengths in gene-level tasks [86] [5]. Model selection should be guided by specific application requirements.
Data leakage concerns: Given that many scFMs are pretrained on massive single-cell corpora that may include perturbation data, rigorous protocols are needed to ensure that "unseen" perturbations in benchmarks are truly novel and not present in any form during pretraining [10].
Biological plausibility vs. quantitative accuracy: While quantitative metrics like L2 distance are important, biological plausibility of predictions should also be assessed through gene set enrichment analysis, pathway activation scores, and expert biological validation.

Current evidence suggests that the field must temper expectations about the zero-shot capabilities of existing scFMs while continuing to develop more sophisticated benchmarks and model architectures. The promising performance of newer approaches like LPM that explicitly model the disentanglement of perturbations, readouts, and contexts points toward potentially fruitful architectural directions [4].

The assessment of zero-shot prediction capabilities for unseen perturbations reveals both significant challenges and promising pathways forward. Current scFMs show limited ability to generalize beyond their training data to novel perturbations, particularly for complex genetic interactions and cross-context transfer. However, standardized benchmarking frameworks like BioLLM are enabling more rigorous comparisons, while novel architectures like LPM suggest potential strategies for improvement [86] [4] [5].

Critical future directions include: (1) developing more sophisticated benchmarking protocols that better reflect real-world biological discovery scenarios, (2) creating models that explicitly represent biological mechanisms rather than relying solely on statistical patterns in training data, (3) improving the integration of diverse data types including protein structures, pathway information, and chemical properties, and (4) establishing clearer evaluation metrics that balance quantitative accuracy with biological plausibility.

As the field progresses, the community would benefit from increased focus on model interpretability, better documentation of pretraining data composition, and more rigorous separation between training and evaluation data to enable true assessment of generalization capabilities. Through these efforts, the vision of accurate in silico prediction of perturbation effects may gradually transition from aspirational to achievable.

In the field of single-cell genomics, single-cell Foundation Models (scFMs) have emerged as transformative tools for interpreting the complex language of gene expression. Models like Geneformer, scGPT, scBERT, and UCE are pretrained on millions of single-cell transcriptomes, promising to capture universal biological principles and accelerate discovery in areas like drug development and disease modeling [10]. A core application driving this promise is in silico perturbation modeling—the ability to computationally predict how cells respond to genetic or chemical interventions. However, as these models are increasingly considered for high-stakes research, a critical and comparative evaluation of their capabilities, limitations, and optimal use cases is essential. This application note synthesizes recent benchmarking studies and practical protocols to provide a structured framework for researchers, scientists, and drug development professionals to effectively leverage these leading scFMs within their perturbation modeling workflows.

Model Architectures and Pretraining Paradigms

The comparative strength of an scFM is fundamentally shaped by its architectural choices and the data on which it was trained. The table below summarizes the core design principles of the four leading models.

Table 1: Architectural and Pretraining Overview of scFMs

Model	Architecture	Pretraining Data Scale	Input Gene Representation	Primary Pretraining Task
Geneformer [10] [87]	Transformer Encoder	30 million cells	2,048 ranked genes (no expression values)	Masked Gene Modeling (MGM) with Gene ID prediction
scGPT [10] [88]	Transformer Encoder	33 million cells	~1,200 HVGs with binned expression values	Iterative MGM with MSE loss; generative pretraining
scBERT [10] [89]	Transformer Encoder	Not specified in context	Not specified in context	MGM with gene ID prediction
UCE [11] [10]	Transformer Encoder	36 million cells	1,024 genes sampled by expression & genomic position	Binary prediction of whether a gene is expressed

Key distinctions include how they handle gene expression values: scGPT incorporates expression magnitudes through binning, whereas Geneformer uses a rank-based approach, discarding absolute expression to focus on the relative order of genes. UCE uniquely integrates protein sequence information by initializing its token embeddings using ESM-2 protein language model embeddings, providing a direct link to proteomic data [10].

Performance Benchmarking in Perturbation Modeling

A critical application of scFMs is predicting the transcriptomic changes following genetic or chemical perturbations. Recent rigorous benchmarks, however, reveal a significant performance gap between promise and practice.

Benchmarking Against Simple Baselines

A landmark study published in Nature Methods directly compared five foundation models, including scGPT and Geneformer, against deliberately simple linear baselines for predicting transcriptome changes after single or double genetic perturbations [11]. The simple baselines were:

No-change model: Always predicts the control condition's expression.
Additive model: For a double perturbation, predicts the sum of the individual logarithmic fold changes.

Strikingly, none of the deep learning models outperformed the simple additive baseline in predicting double perturbation effects [11]. Furthermore, the models showed a poor ability to predict genetic interactions (e.g., synergistic or buffering effects), with none performing better than the no-change baseline [11].

Generalization to Unseen Perturbations

The ability to predict the effect of a completely novel perturbation is a key claim for foundation models. In this task, a simple mean predictor (which always predicts the average expression profile across the training set) and a linear model using Gene Ontology (GO) features consistently matched or outperformed sophisticated fine-tuned foundation models like scGPT and scFoundation across multiple Perturb-seq datasets [81]. In some cases, using the gene embeddings from scGPT as features for a simple Random Forest regressor yielded better performance than scGPT's own full fine-tuned pipeline, suggesting that the pretrained embeddings contain valuable biological information that the models' complex decoders struggle to leverage effectively for this task [11] [81].

Table 2: Summary of Benchmarking Results in Perturbation Prediction

Task	Top Performing Model(s)	Underperforming Model(s)	Key Metric
Double Perturbation Effect Prediction	Additive Baseline Model [11]	scGPT, Geneformer, UCE, scBERT, GEARS, CPA [11]	L2 distance on top genes
Genetic Interaction Prediction	No-change Baseline Model [11]	scGPT, Geneformer, UCE, scBERT, GEARS, CPA [11]	True-Positive Rate vs. False Discovery Proportion
Unseen Single Perturbation Prediction	Random Forest with GO features; Mean Predictor [81]	scGPT, scFoundation (fine-tuned) [81]	Pearson Delta (Δ) Correlation
Zero-shot Cell Type Clustering	HVG Selection, scVI, Harmony [90]	scGPT, Geneformer (zero-shot) [90]	Average BIO Score

Practical Application Protocols

Despite benchmarking challenges, these models are powerful tools when applied correctly. The following protocols outline standard workflows for two common scenarios.

Protocol 1: Fine-tuning scGPT for Cell Type Annotation

This protocol is adapted from an end-to-end workflow for achieving high-accuracy retinal cell type annotation [91]. The following diagram illustrates the overall workflow.

Title: scGPT Fine-tuning and Inference Workflow

Key Steps:

Data Preprocessing: Clean, normalize, and bin the raw gene expression count matrix. The output is a preprocessed H5AD file, standardized for the subsequent steps [91].
Fine-tuning Setup: Load the preprocessed data and a scGPT model pretrained on 33 million cells. The model is then further trained (fine-tuned) on the annotated single-cell data for a limited number of epochs (e.g., 5-10) to specialize in the specific cell type classification task [92] [91].
Inference & Evaluation: Use the fine-tuned model to predict cell types on the entire dataset or a held-out test set. Key outputs include a UMAP visualization for cluster inspection and a CSV file with prediction results. If ground truth labels are available, a confusion matrix is generated for accuracy assessment [91]. This protocol has been reported to achieve an F1-score of 99.5% on a retinal cell dataset [91].

Protocol 2: Fine-tuning Geneformer for Metadata Prediction

This protocol outlines the process for adapting Geneformer to predict donor metadata, such as age group, from single-cell transcriptomes [87].

Key Steps:

Data Tokenization: Convert the gene expression profile of each cell into a sequence of gene tokens. Geneformer's canonical approach involves ranking genes by expression and using the top 2,000 genes, truncated to a maximum length (e.g., 512 tokens per cell) [87].
Model Configuration: Fine-tune Geneformer as a classifier, typically by unfreezing only the final few transformer layers. This strategy leverages the general knowledge from pretraining while efficiently adapting the model to the new task, mitigating overfitting [87].
Training and Evaluation: Train the model on the tokenized sequences with the age group as the label. Performance is evaluated on a held-out test set of cells, using metrics like the F1-score to account for class imbalance. In a benchmark predicting donor age from NK cells, a fine-tuned Geneformer model achieved an F1-score of 0.63, outperforming a Random Forest classifier (F1-score of 0.47) [87].

The Scientist's Toolkit: Essential Research Reagents

Successful application of scFMs relies on a foundation of data, software, and computational resources.

Table 3: Essential Research Reagents and Resources

Item Name	Function / Application	Example / Specification
Annotated H5AD File	Standard data container for single-cell data; primary model input.	Contains `.X` (expression matrix), `.obs` (cell metadata), and `.var` (gene metadata) [88].
High-Variable Genes (HVGs)	Reduces dimensionality and computational cost; focuses model on most informative genes.	Typically the top 1,000-2,000 HVGs are used as input for models like scGPT [88].
Pre-trained Model Weights	Provides the foundation of biological knowledge; starting point for fine-tuning.	Downloaded from official sources (e.g., Hugging Face for Geneformer, Google Drive for scGPT) [88] [87].
GPU Computing Resource	Accelerates model fine-tuning and inference, reducing time from hours to minutes.	Tested on setups like an NVIDIA A100, T4, or consumer-grade hardware with sufficient VRAM (>=32GB system RAM recommended) [88] [89].
Gene Ontology (GO) Annotations	Provides prior biological knowledge; can be used as features in simple, high-performing baseline models [81].	Used as input for Random Forest or linear models in benchmarking studies.
Perturbation Datasets	Gold-standard data for training and benchmarking in silico perturbation models.	Includes Norman et al. (CRISPRa), Adamson et al., and Replogle et al. (CRISPRi) datasets [11] [81].

The current landscape of single-cell foundation models presents a paradox of immense potential tempered by rigorous benchmarking. Models like Geneformer and scGPT have demonstrated strong performance in specific tasks like cell type annotation and metadata prediction when properly fine-tuned [87] [91]. However, for the pivotal task of in silico perturbation prediction, they have not yet consistently surpassed simple, biologically-informed baselines [11] [81]. This suggests that while their pretrained embeddings capture valuable biological information, their complex architectures may not be optimally decoding this information for predictive causal tasks.

For researchers and drug development professionals, the following strategic recommendations are proposed:

Validate Against Baselines: Before deploying a complex scFM for perturbation prediction, always benchmark its performance against simple baselines, including a mean predictor and linear models using GO term features [11] [81].
Choose the Right Tool for the Task: Leverage scGPT and Geneformer for tasks they excel at, such as fine-tuning for cell classification or leveraging their embeddings for exploratory data analysis. For perturbation screens, consider hybrid approaches that use scFM embeddings as features for simpler, more robust models [11] [92].
Prioritize Fine-tuning for High-Accuracy Needs: For rapid dataset exploration, zero-shot use of scGPT may be sufficient. However, for publication-grade or clinical-grade annotations, investing the resources to fine-tune the model on a curated, task-specific dataset is critical and can yield significant accuracy improvements [92] [91].

The ongoing development of scFMs is a rapidly evolving frontier. Future model generations, trained on even larger and more diverse datasets and potentially incorporating more causally-aware architectures, are poised to more fully deliver on the promise of accurate in silico perturbation modeling.

Single-cell foundation models (scFMs), such as scGPT and Geneformer, represent a paradigm shift in computational biology, trained on millions of cells from atlases like the Human Cell Atlas to learn universal representations of cellular states [18]. These models are increasingly employed for in silico perturbation (ISP) prediction, aiming to simulate cellular responses to genetic or chemical interventions without costly experiments [9] [4]. However, a fundamental challenge arises from the data fidelity gap—the discrepancy between the large-scale, observational "atlas" data used for pretraining and the specific, high-fidelity data generated in controlled perturbation experiments. This application note examines the technical basis of this gap, presents quantitative evidence of its impact on model performance, and provides detailed protocols to bridge it, thereby enhancing the predictive accuracy of scFMs in therapeutic discovery contexts.

Results

Quantitative Evidence of the Data Fidelity Gap

The performance disparity between models using only atlas data versus those incorporating targeted perturbation data is substantial and measurable. Systematic benchmarking reveals that while scFMs are powerful, their open-loop ISP predictions can suffer from low positive predictive value.

Table 1: Performance Comparison of Open-Loop vs. Closed-Loop ISP for T-cell Activation Prediction

Model Type	Positive Predictive Value (PPV)	Negative Predictive Value (NPV)	Sensitivity	Specificity	AUROC
Open-Loop ISP (Geneformer)	3%	98%	48%	60%	0.63
Differential Expression (DE)	3%	78%	40%	50%	Not reported
DE & Open-Loop ISP Overlap	7%	Not reported	Not reported	Not reported	Not reported
Closed-Loop ISP (with perturbation data)	9%	99%	76%	81%	0.86

Data derived from evaluation of Geneformer-30M-12L fine-tuned on T-cell activation status [9].

As shown in Table 1, the integration of even a limited number of perturbation examples during fine-tuning can produce a three-fold increase in PPV while simultaneously improving sensitivity and specificity [9]. This demonstrates that the data fidelity gap is not merely theoretical but has concrete, quantifiable effects on a model's ability to identify true positive targets.

The Underlying Causes of the Fidelity Gap

The data fidelity gap arises from several interconnected technical and biological factors:

Representational Mismatch: Atlas data primarily captures homeostatic or developmental cellular states, providing a snapshot of natural heterogeneity. It lacks systematic examples of causal transitions induced by external perturbations, making models trained solely on this data poorly equipped to simulate dynamic responses [18] [4].
Context Specificity: Perturbation effects are highly context-dependent, varying by cell type, microenvironment, and metabolic state. Foundation models pretrained on broad atlas data may average these specific effects, limiting their precision for focused experimental simulations [4].
Limited Perturbation Vocabulary: The "vocabulary" of perturbations represented in atlas data (e.g., natural mutations or disease-associated variations) is narrow compared to the vast space of possible therapeutic interventions (e.g., chemical inhibitors, CRISPRa/i) [4].

Experimental Protocols

Protocol 1: Implementing a Closed-Loop Fine-Tuning Framework for scFMs

This protocol describes a iterative fine-tuning process that incorporates experimental perturbation data to "close the loop" and enhance a model's predictive accuracy for a specific biological context, such as a disease model [9].

Workflow Overview:

Step-by-Step Procedure:

Initial Context Fine-Tuning
- Objective: Adapt a pretrained scFM (e.g., Geneformer or scGPT) to distinguish between relevant cellular states (e.g., diseased vs. healthy).
- Input Data: Single-cell RNA sequencing (scRNA-seq) data from control and case conditions (e.g., RUNX1-knockout hematopoietic stem cells (HSCs) vs. control HSCs) [9].
- Method: Fine-tune the model using the standard procedure for the chosen scFM. For Geneformer, this involves further training with a learning rate of 5e-5 and a classification head to accurately assign cells to their state of origin. Validate classification accuracy on a hold-out test set (target >95%).
Open-Loop In Silico Perturbation
- Objective: Generate initial predictions of genes that, when perturbed, shift the disease state toward a healthy state.
- Method: Using the fine-tuned model from Step 1, perform ISP for each gene in the genome. Simulate both knockout and overexpression scenarios. The model outputs a predicted shift in cell state for each perturbation.
Target Prioritization
- Objective: Select a manageable set of high-confidence targets for experimental validation.
- Method: Combine ISP predictions with results from differential expression analysis. Prioritize genes predicted by both methods or those with the strongest effect sizes in the desired direction. Filter for genes with available chemical inhibitors or CRISPR tools for downstream testing.
Experimental Validation
- Objective: Generate high-fidelity ground-truth data for the prioritized targets.
- Methods:
  - For Genetic Perturbations: Use Perturb-seq (CRISPR-based screening coupled with scRNA-seq) in the relevant cell model (e.g., RUNX1-FPD HSCs). Target the top 20-50 genes identified in Step 3.
  - For Functional Validation: Use orthogonal assays like flow cytometry to measure specific functional outcomes (e.g., cytokine production in T-cells) [9].
- Output: Labeled scRNA-seq or functional data confirming the effect of each perturbation on the cellular state.
Closed-Loop Fine-Tuning
- Objective: Integrate the experimental validation data back into the model to improve its predictive rules.
- Input Data: Pool the original training data from Step 1 with the new scRNA-seq data from Step 4. The perturbation data should be labeled with the resulting cellular state (e.g., "shifted toward control" or "not shifted").
- Method: Continue fine-tuning the model from Step 1 on this combined dataset. Even a small number of perturbation examples (as few as 20) can significantly enhance performance [9].

Protocol 2: Utilizing Large Perturbation Models (LPMs) for Multi-Task Learning

For researchers without immediate access to wet-lab capabilities, leveraging existing large-scale perturbation models and datasets provides an alternative strategy to mitigate the fidelity gap.

Workflow Overview:

Step-by-Step Procedure:

Model Selection and Access
- Objective: Identify a pre-built LPM trained on diverse perturbation experiments.
- Resources: Models like the described LPM integrate data from sources such as the Library of Integrated Network-Based Cellular Signatures (LINCS), which contains thousands of genetic and chemical perturbation experiments across multiple contexts [4].
Querying the Model for Target Discovery
- Objective: Use the LPM to simulate perturbations within a disease-relevant context.
- Method: Leverage the model's disentangled architecture, which separately represents Perturbation (P), Readout (R), and Context (C).
  - Specify the biological context (e.g., a specific cell line from LINCS).
  - Query the effect of a specific perturbation (e.g., knockout of a gene of interest) on a desired readout (e.g., transcriptome).
- Output: A predicted post-perturbation gene expression profile.
Mechanism of Action (MoA) Analysis
- Objective: Identify shared mechanisms between genetic and chemical perturbations.
- Method: Analyze the joint latent space of the LPM. The model clusters perturbations with similar mechanisms, allowing for hypothesis generation. For example, a compound of unknown mechanism positioned near a CRISPR intervention targeting mTOR suggests a shared pathway [4].

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Computational and Experimental Reagents for Bridging the Fidelity Gap

Category	Reagent / Tool	Function / Description	Key Application in Protocol
Foundation Models	scGPT [18]	A generative pretrained transformer model for single-cell multi-omics analysis. Pretrained on >33 million cells.	Protocol 1: Base model for context fine-tuning and open-loop ISP.
	Geneformer [9] [4]	A transformer model pretrained on a large corpus of transcriptomic data to learn a foundational representation of network dynamics.	Protocol 1: Used for closed-loop fine-tuning in T-cell and RUNX1-FPD case studies.
	Large Perturbation Model (LPM) [4]	A model integrating heterogeneous perturbation data by disentangling Perturbation, Readout, and Context (PRC) dimensions.	Protocol 2: Primary model for predicting outcomes of unobserved perturbations and MoA analysis.
Data Platforms	DISCO / CZ CELLxGENE [18]	Platforms aggregating over 100 million cells for federated analysis and data retrieval.	Protocol 1: Source of initial scRNA-seq data for context fine-tuning.
	LINCS Database [4]	A repository containing perturbation responses for thousands of genetic and chemical perturbagens across many cell lines.	Protocol 2: Key data source for training and querying LPMs.
Experimental Tools	Perturb-seq [9]	A high-throughput method combining CRISPR-based perturbations with single-cell RNA sequencing to read out molecular phenotypes.	Protocol 1: Critical for generating high-fidelity validation data for closed-loop learning.
	CRISPRi/a Screens [9]	CRISPR interference or activation screens to repress or activate target genes, often coupled with functional readouts (e.g., flow cytometry).	Protocol 1: Provides orthogonal, functional validation of ISP predictions.

The data fidelity gap is a critical, yet addressable, challenge in the application of single-cell foundation models to perturbation modeling. Quantitative benchmarks demonstrate that moving from an open-loop to a closed-loop framework can drastically improve predictive accuracy. The protocols provided here—ranging from a comprehensive wet-lab-in-the-loop fine-tuning process to a computational-focused approach using existing LPMs—offer actionable roadmaps for researchers. By systematically integrating high-fidelity perturbation data, scFMs can evolve from powerful pattern recognition engines into reliable, predictive virtual cells capable of accelerating therapeutic discovery.

Conclusion

In silico perturbation modeling with single-cell foundation models represents a paradigm shift with immense potential for accelerating therapeutic discovery and understanding cellular mechanisms, particularly for rare diseases where patient samples are scarce. The development of 'closed-loop' frameworks demonstrates a promising path forward, showing that iterative incorporation of experimental data can significantly boost predictive accuracy. However, recent comprehensive benchmarks present a sobering counterpoint, revealing that current scFMs often fail to outperform deliberately simple baselines on perturbation prediction tasks. This underscores that the field is still in its nascent stages, with significant challenges remaining in model generalization, interpretability, and computational efficiency. The future success of this field will depend on developing more specialized architectures, curating higher-quality and more diverse perturbation datasets, establishing rigorous and standardized benchmarking practices, and fostering a tighter integration between computational prediction and experimental validation. Ultimately, overcoming these hurdles will be crucial for transforming the promise of 'virtual cells' into a reliable tool for biomedical discovery and clinical translation.