In silico perturbation modeling using single-cell foundation models (scFMs) promises to revolutionize biological discovery and therapeutic development by predicting cellular responses to genetic and chemical interventions.
In silico perturbation modeling using single-cell foundation models (scFMs) promises to revolutionize biological discovery and therapeutic development by predicting cellular responses to genetic and chemical interventions. This article explores the foundational concepts of scFMs, their architectural principles, and their application in predicting perturbation effects. It critically examines current methodological approaches, including the emerging 'closed-loop' fine-tuning paradigm, which significantly enhances predictive accuracy by iteratively incorporating experimental data. Furthermore, the article addresses the pressing challenges and limitations highlighted by recent rigorous benchmarks, which show that current models often struggle to outperform simple linear baselines. Finally, it provides a comprehensive overview of the validation landscape, synthesizing insights from multiple benchmarking studies to guide researchers in evaluating model performance and to outline a path forward for realizing the full potential of virtual cell models in biomedical research.
Foundation models represent a revolutionary class of artificial intelligence systems trained on vast datasets using self-supervised learning objectives, enabling them to develop generalized representations that can be adapted to diverse downstream tasks without task-specific training [1]. In biology, these models are transforming how researchers analyze complex biological systems by learning fundamental patterns from massive, unlabeled datasets including genomic sequences, single-cell transcriptomes, and protein structures [2]. The core innovation of foundation models lies in their self-supervised pretraining phase, where models learn to predict masked or contextually relevant elements within their input data, thereby capturing deep biological relationships without human-provided labels [1] [3].
The application of foundation models to biological data represents a paradigm shift from traditional supervised approaches, which require extensive labeled datasets that are often expensive and time-consuming to create [3]. Instead, biological foundation models leverage the enormous quantities of unlabeled data being generated by modern high-throughput technologies, from single-cell sequencing platforms to genomic databases [1] [2]. This approach has proven particularly powerful in biological domains where labeled data is scarce but unlabeled data is abundant, enabling models to learn the fundamental "language" of biology—whether that be the grammar of gene regulation, the syntax of protein folding, or the vocabulary of cellular states [1].
Table: Key Characteristics of Biological Foundation Models
| Characteristic | Description | Biological Examples |
|---|---|---|
| Self-Supervised Pretraining | Models learn by predicting masked portions of input data without human labeling | Predicting masked genes in single-cell data [1] |
| Transfer Learning | Pretrained models adapt to new tasks with minimal additional training | Geneformer fine-tuned for disease-specific predictions [2] |
| Scalability | Models trained on millions to billions of data points | scGPT trained on ~30 million cells [2] |
| Multi-task Capability | Single model handles diverse prediction tasks | LPM predicts perturbation effects and identifies mechanisms [4] |
Self-supervised learning (SSL) represents the foundational training paradigm that enables foundation models to learn meaningful representations from unlabeled biological data [3]. In biological contexts, SSL methods create training signals directly from the data itself by designing pretext tasks that require the model to learn intrinsic patterns and relationships [3]. For genomic sequences, this might involve predicting missing nucleotides or reverse-complement sequences; for single-cell data, this typically means predicting masked gene expressions based on the context of other genes within the same cell [1] [3].
The transformer architecture has emerged as the dominant backbone for biological foundation models due to its ability to capture long-range dependencies and complex relationships within sequential data [1]. In single-cell biology, transformers process gene expression profiles by treating individual genes as "tokens" analogous to words in a sentence, allowing the model to learn how genes co-express and regulate one another across diverse cellular contexts [1]. Models like scBERT and Geneformer employ bidirectional attention mechanisms that consider all genes simultaneously, enabling comprehensive understanding of gene-gene interactions [1] [5]. Alternatively, decoder-based models like scGPT use autoregressive approaches that predict gene expressions sequentially, similar to how language models generate text [1].
Tokenization strategies form a critical component of biological foundation models, determining how raw biological data is transformed into model-processable units [1]. For single-cell data, this involves converting gene expression profiles into discrete tokens, typically by binning expression values or ranking genes by expression level within each cell [1]. A key challenge in this process is that gene expression data lacks natural sequential ordering—unlike words in a sentence—requiring researchers to impose artificial orderings based on expression magnitude or other criteria [1]. Advanced tokenization approaches may incorporate additional biological context, such as gene ontology terms or chromosomal locations, to enrich the input representations [1].
Single-cell foundation models (scFMs) have emerged as powerful tools for in silico perturbation modeling, enabling researchers to simulate cellular responses to genetic and chemical perturbations without conducting expensive wet-lab experiments [4] [6]. These models learn the fundamental principles of cellular organization from large-scale single-cell atlases, capturing how gene networks interact and respond to disturbances [1]. When applied to perturbation modeling, scFMs can predict transcriptomic changes resulting from gene knockouts, drug treatments, or other interventions, significantly accelerating biological discovery and drug development [4].
The Large Perturbation Model (LPM) represents a cutting-edge approach that specifically addresses the challenges of integrating heterogeneous perturbation data across different experimental contexts, readout modalities, and perturbation types [4]. LPM employs a disentangled architecture that separately represents perturbations (P), readouts (R), and contexts (C) as distinct dimensions, enabling the model to learn generalizable perturbation-response rules that transfer across biological settings [4]. This approach has demonstrated superior performance in predicting post-perturbation transcriptomes compared to existing methods, while also enabling the identification of shared molecular mechanisms between chemical and genetic perturbations [4].
Table: Performance Comparison of Perturbation Modeling Approaches
| Method | Architecture | Perturbation Types Supported | Prediction Accuracy (Pearson R) | Key Applications |
|---|---|---|---|---|
| LPM [4] | PRC-disentangled transformer | Genetic, chemical, multi-omics | 0.72-0.89 (across contexts) | Mechanism identification, drug-target mapping |
| Geneformer [4] [2] | Transformer encoder | Genetic | 0.61-0.75 | Network dynamics, disease modeling |
| scGPT [4] [5] | GPT-style decoder | Genetic, chemical | 0.65-0.81 | Cell annotation, multi-omic integration |
| CPA [4] | Autoencoder | Chemical, combinations | 0.58-0.72 | Drug combination prediction |
| GEARS [4] | Graph-enhanced MLP | Genetic | 0.63-0.78 | Genetic interaction mapping |
In pharmaceutical research, perturbation models are increasingly used to identify novel therapeutic applications for existing compounds and to understand their mechanisms of action [4] [2]. For example, LPM has demonstrated the ability to cluster pharmacological inhibitors with genetic perturbations targeting the same genes, effectively mapping compound-CRISPR relationships in a unified latent space [4]. This approach identified the anti-inflammatory properties of pravastatin, which clustered near non-steroidal anti-inflammatory drugs in the perturbation space—a finding corroborated by clinical observations [4]. Similarly, scGPT-enabled analysis of tumor-associated macrophages identified C5aR1 gene expression as a key modulator of PARP inhibitor resistance in breast cancer models, suggesting promising therapeutic targets [2].
Objective: Establish a standardized environment for scFM-based perturbation analysis using containerized solutions to ensure reproducibility across research teams [5].
Materials:
Procedure:
pip install biollmData Preprocessing:
pp.normalize_totalModel Initialization:
Objective: Simulate transcriptional responses to genetic and chemical perturbations using the Large Perturbation Model architecture [4].
Materials:
Procedure:
Model Inference:
Result Interpretation:
Objective: Compare performance across different scFMs for perturbation prediction tasks using standardized evaluation metrics [5].
Materials:
Procedure:
Model Comparison:
Results Analysis:
Table: Research Reagent Solutions for scFM Perturbation Modeling
| Resource | Type | Function | Example Implementation |
|---|---|---|---|
| BioLLM [5] | Software Framework | Standardized API for multiple scFMs | Unified interface for scGPT, Geneformer, scBERT |
| CELLxGENE [1] | Data Repository | Curated single-cell datasets | >100 million standardized cells for model training |
| LPM [4] | Specialized Model | Multi-modal perturbation prediction | PRC-disentangled architecture for cross-context prediction |
| scvi-tools [2] | Analysis Suite | Probabilistic modeling of single-cell data | Differential expression, dimensionality reduction |
| TabPFN [7] | Tabular Foundation Model | Small-sample tabular predictions | Bayesian inference for experimental design |
| Self-GenomeNet [3] | SSL Method | Genomic sequence representation | Reverse-complement aware pre-training |
The integration of foundation models into biological research requires both computational resources and specialized knowledge. For researchers beginning with scFMs, starting with user-friendly frameworks like BioLLM provides immediate access to multiple models through standardized APIs, eliminating architectural inconsistencies and simplifying benchmarking [5]. When designing perturbation studies, careful consideration of model selection is crucial—encoder-based models like Geneformer excel at gene-level tasks and network inference, while decoder-based models like scGPT demonstrate stronger performance in cell-level predictions and batch effect correction [5].
Data quality remains paramount for successful perturbation modeling. Researchers should prioritize datasets with appropriate controls, sufficient replication, and minimal technical artifacts. For novel therapeutic applications, integration across multiple evidence streams—including foundation model predictions, electronic health records, and experimental validation—creates a compelling case for candidate targets [4] [2]. As these technologies mature, the scientific community is developing standards for reporting model predictions and establishing benchmarks for methodological comparisons, further accelerating the adoption of foundation models in biological discovery and drug development.
Single-cell foundation models (scFMs) represent a transformative approach in computational biology, drawing direct inspiration from large language models (LLMs) in natural language processing (NLP). The core concept involves reframing cellular biology as a linguistic system, where individual cells are treated as "sentences" and the genes within them as "words" or "tokens" [8]. This analogy allows researchers to apply the powerful transformer architecture, which has revolutionized machine understanding of human language, to decipher the complex "language" of cellular function and state [8]. This paradigm shift is particularly impactful for in silico perturbation modeling, where the goal is to predict how targeted genetic interventions might alter cellular states, potentially accelerating therapeutic discovery [9].
Tokenization is the critical first step that converts raw, non-sequential gene expression data into a structured format that transformer models can process. Unlike words in a sentence, genes have no inherent order, requiring scFMs to implement specific strategies to impose sequence [8].
[CELL] tokens to represent cell-level information, or modality indicators for multi-omics data [8].Most scFMs are built on the transformer architecture, which uses self-attention mechanisms to weigh the importance of all genes (tokens) when processing the information of each individual gene [8]. Two primary architectural variants are employed:
Table 1: Overview of Prominent Single-Cell Foundation Models
| Model Name | Architecture Type | Primary Pretraining Task | Input Gene Count | Key Differentiating Feature |
|---|---|---|---|---|
| Geneformer [10] | Encoder | Masked Gene Modeling (Gene ID prediction) | 2,048 (ranked) | Uses ranked gene lists; lookup table for gene embeddings. |
| scGPT [10] | Encoder (with masking) | Iterative Masked Gene Modeling (Value prediction) | ~1,200 (HVGs) | Value binning; multi-omics capability; generative pretraining. |
| scFoundation [10] | Asymmetric Encoder-Decoder | Read-depth-aware Masked Gene Modeling | ~19,000 | Uses nearly the full transcriptome; value projection. |
| UCE [10] | Encoder | Binary prediction of gene expression | 1,024 (genomic position) | Uses protein sequence embeddings from ESM-2. |
The diagram below illustrates the core workflow of how a single cell's data flows through a typical scFM based on the transformer architecture.
Figure 1: From Cell to Embedding: The Core scFM Workflow. This diagram visualizes the process of converting a cell's gene expression profile into a unified latent representation via tokenization and transformer layers.
In silico perturbation (ISP) is a premier application of scFMs, enabling the prediction of a cell's transcriptional state after a hypothetical genetic manipulation (e.g., gene knockout or overexpression).
This is the baseline method for predicting perturbation effects without incorporating prior experimental perturbation data into the model fine-tuning [9].
Model Fine-Tuning for State Classification:
Perturbation Simulation and Prediction:
Table 2: Performance of Open-Loop ISP vs. Differential Expression (DE) for T-cell Activation [9]
| Method | Positive Predictive Value (PPV) | Negative Predictive Value (NPV) | Sensitivity | Specificity |
|---|---|---|---|---|
| Open-Loop ISP | 3% | 98% | 48% | 60% |
| Differential Expression (DE) | 3% | 78% | 40% | 50% |
| ISP & DE Overlap | 7% | - | - | - |
This advanced protocol iteratively incorporates experimental data to significantly enhance prediction accuracy, creating a "virtuous cycle" of model improvement [9].
Figure 2: The Closed-Loop Framework for Iterative Model Improvement. This workflow demonstrates how integrating experimental perturbation data creates a feedback loop that enhances the scFM's predictive accuracy.
Despite their promise, critical benchmarking studies reveal that the performance of scFMs, particularly for perturbation prediction, must be rigorously evaluated against simpler baselines.
A 2025 benchmark study compared five scFMs and two other deep learning models against simple linear models for predicting transcriptome changes after single or double genetic perturbations [11]. The findings were sobering:
Table 3: Key Findings from Benchmarking scFMs on Perturbation Prediction [11]
| Benchmark Scenario | Top Performing Model(s) | Key Implication |
|---|---|---|
| Double Gene Perturbation | Additive Linear Model (Baseline) | Current scFMs fail to capture non-additive genetic interactions better than a simple heuristic. |
| Unseen Single Gene Perturbation | Mean Prediction (Baseline); Linear Model with Perturbation Data | Pretraining on single-cell atlases offers less predictive power than pretraining on perturbation data itself. |
| Use of Model Embeddings | Linear Model using scGPT/scFoundation Gene Embeddings | Pretrained embeddings contain valuable biological knowledge, but may be better utilized by simpler models. |
The closed-loop framework has shown tangible success in a real disease context. Researchers applied it to RUNX1-Familial Platelet Disorder (RUNX1-FPD), a rare blood disorder [9]. After fine-tuning Geneformer on HSCs with RUNX1 loss-of-function, closed-loop ISP identified 14 high-confidence gene targets whose perturbation could shift diseased cells toward a healthy state. This led to the identification of several therapeutic pathways, including mTOR and protein kinase C, demonstrating the potential of scFMs to accelerate drug discovery for rare diseases where samples are scarce [9].
Table 4: Key Research Reagent Solutions for scFM-Based Perturbation Studies
| Item / Reagent | Function in scFM Research |
|---|---|
| Public Cell Atlas Data (e.g., CZ CELLxGENE) [8] | Provides the large-scale, diverse single-cell datasets required for pretraining scFMs. Serves as a source of healthy/diseased reference data. |
| Perturb-seq / CRISPR Screens [9] | Generates the essential ground-truth dataset of single-cell transcriptomes following experimental genetic perturbations. Critical for closed-loop fine-tuning. |
| High-Quality scRNA-seq Datasets | Used for the initial fine-tuning of scFMs to learn the transcriptional signatures of specific biological states (e.g., T-cell activation, disease model vs. control). |
| Engineered Cell Models [9] | Provides a controlled system for modeling genetic diseases (e.g., RUNX1-FPD) and validating in silico perturbation predictions. |
| GPU Computing Clusters | Provides the necessary computational power for the fine-tuning and inference of large transformer models, which is computationally intensive [8]. |
Single-cell foundation models (scFMs) represent a revolutionary convergence of deep learning and computational biology, with transformer architectures at their core. These models fundamentally reinterpret cellular biology by treating individual cells as "sentences" and genes or genomic features as "words" or "tokens" [12]. This conceptual framework allows researchers to analyze cellular heterogeneity and complex regulatory networks using the same architectural principles that have revolutionized natural language processing. The adaptation of transformer models to single-cell genomics addresses a critical need for unified frameworks capable of integrating and comprehensively analyzing rapidly expanding biological data repositories, which now encompass tens of millions of single-cell omics datasets spanning diverse tissues, species, and biological conditions [12] [13].
The core innovation lies in applying self-supervised learning to vast single-cell datasets, enabling models to capture fundamental biological principles that generalize across diverse downstream tasks. Unlike traditional single-task models, scFMs leverage transformer architectures to incorporate diverse omics data—including single-cell RNA sequencing (scRNA-seq), single-cell ATAC sequencing (scATAC-seq), multiome sequencing, spatial transcriptomics, and proteomics—extracting latent patterns at both cell and gene/feature levels [12]. This architectural foundation has enabled breakthroughs in cross-species cell annotation, in silico perturbation modeling, and gene regulatory network inference, representing a paradigm shift toward scalable, generalizable frameworks capable of unifying diverse biological contexts [13].
The transformative capability of scFMs originates from the attention mechanism, which enables models to dynamically weight the importance of different genes when making predictions about cellular states. The mechanism operates through three fundamental components:
These components are derived from the same input gene embeddings through learned linear transformations, allowing the model to project genomic data into spaces where biological relationships become computationally apparent [14]. The attention weights are calculated through scaled dot-product operations, followed by softmax normalization to convert similarity scores into probabilities that highlight the most important genetic relationships for any given cellular context [14].
Advanced scFMs employ multi-head attention, which operates like a team of biological experts analyzing the same cellular data from different perspectives. Each attention "head" independently focuses on distinct biological relationships—such as regulatory dynamics, functional pathways, or co-expression patterns—with their outputs merged to form rich, nuanced cellular representations [14]. This architectural approach enables models to capture diverse relationship types simultaneously, making them robust to biological variability and complexity [14].
For single-cell data, which lacks natural sequential ordering unlike linguistic data, transformers require specialized adaptation through deterministic gene ranking strategies. Common approaches include ranking genes by expression levels within each cell or partitioning genes into expression value bins, creating artificial "sentences" from fundamentally non-sequential data [12]. Positional encoding schemes are then adapted to represent the relative order or rank of each gene in the cell, preserving critical information about expression hierarchies [12].
scFMs demonstrate significant architectural diversity while maintaining core transformer principles:
Table: Architectural Variations in Single-Cell Foundation Models
| Model Type | Architecture | Tokenization Approach | Gene Ranking Method | Notable Examples |
|---|---|---|---|---|
| Encoder-based | Bidirectional transformer | Gene-level tokens | Expression magnitude ranking | scBERT, Geneformer |
| Decoder-based | Autoregressive transformer | Natural language tokenization | Rank-based sequencing | cell2sentence (C2S) |
| Hybrid | Transformer with specialized components | Combined gene and metadata tokens | Multi-factor ranking | scGPT, scPlantFormer |
Most scFMs use variants of the transformer architecture configured with different attention head counts, layer depths, and hidden dimension sizes [12]. Encoder-based models like scBERT employ bidirectional attention to capture genomic context from both "directions" simultaneously, while decoder-based models like cell2sentence leverage autoregressive approaches that generate gene sequences sequentially [15]. Emerging hybrid architectures incorporate specialized components for spatial relationships, phylogenetic constraints, or multimodal integration [13].
Purpose: To predict transcriptional responses to genetic perturbations and iteratively improve prediction accuracy through experimental feedback.
Background: In silico perturbation (ISP) modeling enables researchers to simulate how cells respond to genetic manipulations without costly wet-lab experiments. The "closed-loop" approach significantly enhances prediction accuracy by incorporating experimental perturbation data during model fine-tuning [9].
Materials:
Procedure:
Troubleshooting:
Purpose: To extract biologically interpretable decision-making circuits from scFMs, connecting model internal mechanisms to known biological pathways.
Background: A significant challenge in scFMs is the "black box" nature of their predictions. Transcoder-based circuit analysis resolves the polysemanticity problem—where individual model components encode multiple biological concepts simultaneously—by decomposing transformer operations into interpretable components [15].
Materials:
Procedure:
Troubleshooting:
Recent systematic benchmarking reveals critical insights into scFM capabilities and limitations, particularly for perturbation prediction tasks:
Table: Benchmarking Results for Perturbation Effect Prediction
| Model/Approach | Double Perturbation Prediction Error (L2 Distance) | Single Perturbation Prediction | Genetic Interaction Detection | Computational Efficiency |
|---|---|---|---|---|
| Simple Additive Baseline | Reference performance | Varies by dataset | Not applicable | Most efficient |
| No Change Baseline | Higher than additive | Outperformed by linear models | Limited to buffering interactions | Most efficient |
| scGPT | Higher than baselines | Comparable to linear models | Poor (mostly buffering) | Moderate |
| Geneformer | Higher than baselines | Below linear models | Poor | Moderate |
| scBERT | Highest among benchmarks | Below linear models | Poor | Less efficient |
| Linear Model with Pretrained Embeddings | N/A | Best performance | Varies | Efficient |
Notably, current scFMs generally do not outperform deliberately simple baselines for perturbation effect prediction, particularly in zero-shot settings where models must generalize without task-specific fine-tuning [11] [16] [17]. The additive baseline model, which simply sums individual logarithmic fold changes for double perturbations, consistently outperforms or matches complex foundation models across multiple benchmarks [11]. Similarly, simple linear models using pretrained perturbation embeddings outperform foundation models for predicting effects of unseen single perturbations [11].
Performance evaluations across multiple domains reveal distinct model strengths and trade-offs:
These findings highlight that model selection must be guided by specific application requirements rather than assuming general superiority of foundation models over simpler approaches.
Table: Essential Research Reagents and Computational Tools for scFM Research
| Resource Name | Type | Primary Function | Application Context |
|---|---|---|---|
| BioLLM | Standardized Framework | Unified interface for diverse scFMs | Model benchmarking and deployment [5] |
| PertEval-scFM | Benchmarking Framework | Standardized evaluation of perturbation predictions | Model performance validation [16] [17] |
| CZ CELLxGENE | Data Repository | Unified access to annotated single-cell datasets | Pretraining data sourcing [12] |
| DISCO | Data Atlas | Aggregated single-cell data for federated analysis | Multimodal data integration [13] |
| cell2sentence (C2S) | Pre-trained Model | Decoder-based scFM with biological literature training | Interpretability studies [15] |
| Geneformer | Pre-trained Model | Encoder-based scFM with focus on gene relationships | Gene-level tasks [5] |
| scGPT | Pre-trained Model | Large-scale transformer supporting multi-omic tasks | General-purpose applications [5] |
While transformer-based scFMs represent a significant architectural advancement in computational biology, substantial challenges remain. Current models face limitations in perturbation effect prediction, often failing to outperform simple linear baselines [11] [16]. The interpretability gap persists despite advances in mechanistic interpretability techniques [15], and batch effects continue to complicate cross-study integration [5].
Future developments will likely focus on specialized architectures for perturbation modeling, improved multimodal integration strategies, and more biologically-grounded benchmarking frameworks. The emergence of closed-loop approaches that iteratively incorporate experimental feedback demonstrates promising pathways for enhancing predictive accuracy [9]. As the field matures, standardized evaluation frameworks like BioLLM [5] and PertEval-scFM [16] [17] will be crucial for directing methodological progress toward biologically meaningful improvements rather than purely algorithmic advancements.
The integration of transformer architectures with single-cell genomics has unquestionably transformed the scale and scope of computational biological analysis. Through continued architectural innovation and rigorous biological validation, scFMs hold the potential to evolve from powerful pattern recognition tools into genuinely predictive in silico models of cellular behavior.
The development of robust single-cell Foundation Models (scFMs) for in silico perturbation modeling is fundamentally constrained by the scale, diversity, and quality of the data used for their pretraining. A foundation model is a large-scale deep learning model pretrained on vast datasets, enabling it to be adapted to a wide range of downstream tasks through self-supervised learning [1]. The premise of scFMs is that by exposing a model to millions of cells encompassing diverse tissues, species, and conditions, it can learn fundamental, generalizable principles of cellular identity and function [1]. For perturbation modeling, this extensive pretraining is critical, as it allows the model to internalize a representation of the "normal" cellular state space, against which the effects of genetic or chemical perturbations can be accurately predicted. The success of models like scGPT, pretrained on over 33 million cells, demonstrates the power of this approach [18]. This protocol details the data sources and methodologies for constructing a comprehensive pretraining corpus tailored for scFMs focused on perturbation biology.
Assembling a pretraining dataset requires leveraging multiple public repositories that host and standardize single-cell data. The table below summarizes the key resources, their primary content, and quantitative metrics relevant for scFM development.
Table 1: Key Public Repositories for Single-Cell and Perturbation Data
| Repository Name | Primary Content & Specialization | Reported Scale (Cells / Datasets) | Notable Features for Perturbation Modeling |
|---|---|---|---|
| CZ CELLxGENE [1] | Curated single-cell census data; multi-species, multi-tissue | Over 100 million cells [1] | Unified access to annotated datasets; standardized for analysis |
| Human Cell Atlas (HCA) [19] | Multi-omic, community-generated open data | 70.3 million cells; 523 projects; 11.2k donors [19] | Aims for a comprehensive reference map of all human cells |
| PerturbSeq.db [20] | Curated single-cell perturbation datasets | 189 datasets (165 scRNA-seq, 24 scATAC-seq) from 77 studies [20] | Dedicated to genetic (CRISPR) and chemical perturbation data |
| Expression Atlas [21] | Bulk and single-cell gene expression under different conditions | Information missing | Provides differential expression data across diverse biological conditions |
| DISCO [18] | Single-cell omics data browser and repository | Aggregates over 100 million cells [18] | Supports federated analysis across multiple data sources |
| Gene Expression Omnibus (GEO) / SRA [1] | Primary archive for high-throughput sequencing data | Thousands of single-cell studies [1] | Raw, primary data; requires significant processing and curation |
This protocol outlines a systematic procedure for building a large-scale, high-quality pretraining dataset from the repositories listed above, with a specific emphasis on enabling robust in silico perturbation modeling.
Objective: To identify and select relevant datasets that maximize biological and technical diversity. Materials: Access to the internet, computational resources for metadata handling. Procedure:
Objective: To download selected data and perform rigorous quality control to ensure dataset integrity.
Materials: High-performance computing cluster, sufficient data storage, tools like wget or aws s3 for data transfer, and single-cell analysis toolkits (e.g., Scanpy in Python).
Procedure:
.h5ad files) where available, as this is the common format for CZ CELLxGENE and many other resources [22].Objective: To merge the individually curated datasets into a unified, analysis-ready corpus while mitigating technical noise. Materials: Integrated development environment (e.g., RStudio, Jupyter Notebook), single-cell integration tools (e.g., scVI, Harmony, Scanorama). Procedure:
The following diagram illustrates the complete workflow from data discovery to a finalized pretraining corpus.
Figure 1: Workflow for building a pretraining corpus for perturbation scFMs.
The following table lists key computational tools, data resources, and platforms that constitute the essential "reagent solutions" for developing scFMs for perturbation modeling.
Table 2: Key Research Reagent Solutions for scFM Pretraining
| Item Name | Type | Primary Function in Pretraining |
|---|---|---|
| PerturbSeq.db [20] | Database | A pre-curated repository of single-cell perturbation datasets, providing ready-made data for training and benchmarking perturbation models. |
| CZ CELLxGENE / HCA [1] [19] | Data Platform | Provides the foundational "baseline" cellular data at scale, essential for teaching the model normal cellular states. |
| scGPT / scPlantFormer [18] | Foundation Model | Examples of state-of-the-art scFMs whose architectures and pretraining protocols can be adopted or adapted for new models. |
| BioLLM [18] | Software Framework | A standardized framework for integrating and benchmarking different single-cell foundation models, enabling performance comparison. |
| sysVI [18] | Computational Tool | A batch integration tool that preserves biological variation while removing technical noise, critical for data harmonization. |
| FISHscale / FISHspace [22] | Analysis Pipeline | Software for processing and analyzing spatial transcriptomics data (e.g., EEL-FISH), allowing for the inclusion of spatial context. |
A critical, iterative step in the protocol is ensuring the quality of the incoming data. The diagram below details the quality control process applied to each dataset before integration.
Figure 2: Data quality control and batch effect audit workflow.
The construction of a high-quality, large-scale pretraining corpus is a foundational step in developing scFMs capable of accurate in silico perturbation modeling. By systematically leveraging public repositories—from specialized resources like PerturbSeq.db for perturbation data to expansive atlases like the HCA for cellular baselines—and adhering to a rigorous protocol of selection, quality control, and integration, researchers can build the robust datasets required to power the next generation of predictive models in computational biology and drug discovery.
Tokenization, the process of converting raw gene expression data into discrete, model-readable units or "tokens," is a foundational step in building single-cell foundation models (scFMs). For in silico perturbation modeling—where the goal is to computationally predict cellular responses to genetic or chemical perturbations—the choice of tokenization strategy directly impacts a model's ability to learn meaningful biological representations and generalize to unseen data. This document outlines prevalent tokenization strategies, provides quantitative comparisons, details experimental protocols for their implementation, and visualizes key workflows and pathways relevant to perturbation modeling.
In single-cell RNA sequencing (scRNA-seq) analysis, tokenization strategies define how the high-dimensional and non-sequential gene expression profile of a single cell is transformed into a structured sequence for transformer-based models [1]. The core challenge is that gene expression data lacks inherent sequence; the order of genes in a cell does not carry semantic meaning as words do in a sentence. scFMs address this by imposing a deterministic order or structure on the gene features.
Table 1: Common Tokenization Strategies for Single-Cell Foundation Models
| Strategy Name | Core Principle | Typical Model Examples | Key Advantages | Key Limitations |
|---|---|---|---|---|
| Expression-Based Ranking | Genes are ordered by their expression value within each cell, and the top-k genes form the input sequence [1]. | scGPT, Geneformer [1] | Simple, deterministic; captures most active genes. | Order is arbitrary and may not reflect biological gene-gene relationships. |
| Expression Binning | Genes are partitioned into bins (e.g., high/medium/low expression) based on their expression values, and the bin membership determines the token [1]. | scBERT [1] | Reduces vocabulary size; can capture coarse-grained expression levels. | Loss of fine-grained, continuous expression information. |
| Direct Normalized Counts | Uses normalized count values (or their log-transform) directly as input features without complex sequencing [1]. | Some scFMs [1] | Preserves full, continuous expression information. | Model must learn to handle high dimensionality and sparsity directly. |
| Convolutional Tokenization | The entire gene expression vector is segmented into fixed-size windows, and 1D-convolution is applied to generate local feature tokens [23]. | scSFUT [23] | Eliminates need for gene selection; uses full gene vector; expands attention receptive field. | Computationally intensive; less interpretable at the single-gene level. |
The performance of a tokenization strategy is intrinsically linked to the downstream task. For in silico perturbation (ISP) prediction, the "closed-loop" framework—which incorporates experimental perturbation data during model fine-tuning—has demonstrated significant improvements over "open-loop" approaches. The following table summarizes performance metrics from a benchmark study that utilized a Geneformer model, highlighting the impact of data integration on ISP accuracy [9].
Table 2: Performance of Open-Loop vs. Closed-Loop In Silico Perturbation Prediction in T-Cell Activation [9]
| Prediction Method | Positive Predictive Value (PPV) | Negative Predictive Value (NPV) | Sensitivity | Specificity | AUROC |
|---|---|---|---|---|---|
| Differential Expression (DE) | 3% | 78% | 40% | 50% | Not Reported |
| Open-Loop ISP | 3% | 98% | 48% | 60% | 0.63 |
| DE + Open-Loop ISP Overlap | 7% | Not Reported | Not Reported | Not Reported | Not Reported |
| Closed-Loop ISP | 9% | 99% | 76% | 81% | 0.86 |
A critical finding for practical implementation is that the performance of the closed-loop model improved dramatically with just 10 perturbation examples and approached saturation with approximately 20 examples, indicating that even modest experimental validation can substantially enhance predictive accuracy [9].
This protocol details the steps for fine-tuning a pre-trained scFM, like Geneformer, using an expression-based ranking tokenization strategy for a specific in silico perturbation task [9] [1].
Materials and Reagents:
Method Details:
Data Preprocessing and Quality Control:
log10(gexp + 1)) to stabilize variance and manage long-tailed distributions [25].Tokenization and Input Sequencing:
Model Fine-Tuning:
In Silico Perturbation Prediction:
This protocol extends Protocol 1 by iteratively incorporating experimental data to refine the scFM, dramatically improving ISP accuracy [9].
Method Details:
Initial Model and Perturbation Screening:
Experimental Validation and Data Integration:
Closed-Loop Fine-Tuning:
The following diagrams illustrate the core closed-loop framework and a key signaling pathway identified through its application.
Closed-Loop ISP Workflow
RUNX1-FPD Signaling Pathways
Table 3: Essential Research Reagents and Computational Tools for scFM and ISP
| Item Name | Function / Application | Specific Examples / Notes |
|---|---|---|
| Pre-trained scFMs | Provides a foundational model that can be fine-tuned for specific tasks, saving computational resources. | Geneformer, scGPT, scBERT [1]. |
| Containerization Platform | Ensures computational reproducibility by encapsulating the entire software environment. | Docker [24]. |
| Integrated Pipelines | Provides pre-defined workflows for processing raw sequencing data into analyzable formats. | RumBall (for RNA-seq), bioBakery Workflows (for metagenomics) [24] [27]. |
| Data Preprocessing Tools | Performs quality control, normalization, and batch effect correction on raw count matrices. | Scanpy in Python [23]. |
| Perturbation Screening Tech | Experimentally validates in silico predictions and generates data for closed-loop learning. | CRISPRi/CRISPRa screens, Perturb-seq [9]. |
| Reference Datasets | Used for model pretraining and as benchmarks for fine-tuned models. | CZ CELLxGENE, Human Cell Atlas, TCGA, GTEx [1] [26]. |
Self-supervised learning (SSL) has emerged as a transformative approach for analyzing single-cell transcriptome data, enabling researchers to extract meaningful biological insights from vast amounts of unlabeled data. By learning representations without manual annotation, SSL methods have demonstrated exceptional capability in capturing complex cellular states and functions, forming the foundational bedrock for advanced in silico perturbation modeling with single-cell foundation models (scFMs). This paradigm shift allows computational biologists to predict cellular responses to genetic and therapeutic interventions, accelerating therapeutic discovery—particularly for rare diseases where experimental data is scarce.
The power of SSL lies in its ability to leverage the intrinsic structure of single-cell RNA sequencing (scRNA-seq) data through pretext tasks that require the model to learn meaningful representations without explicit supervision. These pre-trained models can then be fine-tuned for specific downstream applications with remarkable efficiency. Within the context of scFMs research, SSL provides the essential pre-training framework that enables accurate prediction of perturbation effects, cell-type annotation, and data integration across diverse biological contexts.
Extensive benchmarking across multiple single-cell genomics datasets reveals the nuanced effectiveness of different SSL approaches. The following table summarizes key quantitative findings from large-scale studies evaluating SSL methods on millions of single cells:
Table 1: Performance comparison of self-supervised learning methods on single-cell transcriptomes
| SSL Method | Key Application | Performance Metric | Result | Reference |
|---|---|---|---|---|
| Masked Autoencoder (Random masking) | Cell-type prediction (PBMC) | Macro F1 score | 0.7466 ± 0.0057 | [28] |
| Supervised Baseline | Cell-type prediction (PBMC) | Macro F1 score | 0.7013 ± 0.0077 | [28] |
| Masked Autoencoder (GP masking) | Cell-type prediction (Tabula Sapiens) | Macro F1 score | 0.3085 ± 0.0040 | [28] |
| Supervised Baseline | Cell-type prediction (Tabula Sapiens) | Macro F1 score | 0.2722 ± 0.0123 | [28] |
| Closed-loop ISP Framework | Perturbation prediction (T-cell activation) | Positive Predictive Value | 9% (vs. 3% open-loop) | [9] |
| scPML | Cross-platform cell annotation | Accuracy | 0.87 (mean) | [29] |
| Geneformer | Cross-platform cell annotation | Accuracy | 0.72 (mean) | [29] |
Research indicates that SSL demonstrates particularly strong performance in specific biological scenarios:
Transfer learning applications: SSL pre-training on large auxiliary datasets (e.g., scTab with >20 million cells) significantly improves performance on smaller target datasets for cell-type prediction and gene-expression reconstruction [28]. Improvements are most pronounced for underrepresented cell types, as evidenced by stronger gains in macro F1 scores compared to micro F1 scores [28].
Architectural advantages: Masked autoencoders consistently outperform contrastive learning methods in single-cell genomics applications, contrary to trends observed in computer vision [28] [30]. This advantage is maintained across multiple masking strategies, including random masking and biologically-informed gene program masking.
Data efficiency: The "closed-loop" framework for perturbation modeling demonstrates that incorporating even small amounts of experimental data (10-20 perturbation examples) during fine-tuning can substantially improve prediction accuracy [9].
Data Collection: Assemble a large-scale single-cell transcriptomics dataset for pre-training. The CELLxGENE census scTab dataset comprising over 20 million cells across diverse tissues and conditions serves as an ideal starting point [28]. Include all 19,331 human protein-encoding genes to maximize generalizability.
Quality Control: Apply standard scRNA-seq quality control metrics:
Normalization: Normalize gene expression values using standard scRNA-seq processing:
Network Architecture: Implement a fully connected autoencoder network with the following specifications [28]:
Pretext Task Implementation - Masked Autoencoding:
Training Specifications:
Base Model Initialization: Start with a foundation model pre-trained on large-scale single-cell data (e.g., Geneformer) [9] [11].
Task-Specific Fine-tuning:
Closed-Loop Framework Implementation [9]:
In Silico Perturbation Simulation:
Validation and Interpretation:
Table 2: Key research reagents and computational resources for SSL in single-cell transcriptomics
| Resource | Type | Function/Application | Example/Reference |
|---|---|---|---|
| scTab Dataset | Data Resource | Large-scale reference dataset for SSL pre-training; contains >20 million cells | CELLxGENE census [28] |
| Masked Autoencoder | Algorithm | SSL method for learning representations through reconstruction of masked input features | [28] [30] |
| Gene Program Annotations | Biological Knowledge | Curated gene sets for biologically-informed masking strategies | Pathway databases [29] |
| Geneformer | Foundation Model | Pre-trained transformer model for single-cell transcriptomics | [9] [11] |
| Closed-Loop Framework | Methodology | Approach for incorporating experimental data to improve perturbation predictions | [9] |
| scPML | Software Tool | Pathway-based multi-view learning for cell type annotation | [29] |
| Perturb-seq Data | Experimental Data | Single-cell CRISPR screening data for perturbation model training | [9] [11] |
Self-supervised learning represents a paradigm shift in the analysis of single-cell transcriptomes, providing a powerful framework for extracting biological insights from unlabeled data at scale. The protocols and applications outlined in this document demonstrate the tangible benefits of SSL in enhancing cell-type annotation, data integration, and—most critically—predicting cellular responses to perturbations. As the field progresses toward more sophisticated "virtual cell" models, SSL will continue to serve as the foundational element enabling accurate in silico experiments and accelerating therapeutic discovery, particularly for rare diseases where experimental data remains limited. The integration of SSL pre-training with closed-loop experimental validation creates a powerful cycle of discovery that promises to transform computational biology and drug development.
In silico perturbation (ISP) represents a transformative computational approach in cellular biology, enabling researchers to predict the effects of genetic manipulations—such as gene knockouts and overexpression—without conducting costly and time-intensive laboratory experiments. This methodology leverages single-cell Foundation Models (scFMs), which are large-scale deep learning models pre-trained on vast datasets comprising millions of single-cell transcriptomes [1]. These models learn fundamental principles of cellular biology and gene regulation, allowing them to be fine-tuned for specific tasks like predicting transcriptional changes following genetic perturbations [9] [31]. The core premise of ISP is the creation of "virtual cells" that can simulate cellular responses to diverse perturbations, thus accelerating biological discovery and therapeutic development, particularly for rare diseases where patient samples are scarce [9].
The workflow operates by treating individual cells as "sentences" and genes or genomic features as "words" or "tokens" [1]. Through sophisticated tokenization and embedding processes, scFMs can model the complex, high-dimensional relationships within gene regulatory networks. When a perturbation is simulated, the model predicts how the removal (knockout) or enhanced expression (overexpression) of specific genes alters the transcriptional state of the cell [9] [32]. This capability is invaluable for prioritizing gene targets for functional validation, understanding disease mechanisms, and identifying potential therapeutic interventions [9].
The ISP workflow involves a sequence of critical steps, from data preparation and model setup to the execution and validation of in silico experiments. The following diagram illustrates the logical flow and key decision points in a standard ISP pipeline.
The initial phase involves curating high-quality single-cell RNA sequencing (scRNA-seq) data, which serves as the input for the scFM. The model requires a gene-by-cell count matrix from wild-type (WT) samples [32]. A critical challenge is that gene expression data lacks inherent sequential order, unlike words in a sentence. To address this, scFMs employ various tokenization strategies to structure the data for the model:
Additional special tokens may be incorporated to provide biological context, such as cell identity metadata, modality indicators for multi-omics data, or gene ontology information [1]. Positional encoding schemes are then applied to represent the relative order or rank of each gene in the cell.
Selecting an appropriate scFM is crucial for ISP success. Current models vary in their architectures, pretraining data, and specific capabilities. The table below summarizes key models and their applications in ISP.
Table 1: Single-Cell Foundation Models for In Silico Perturbation
| Model Name | Architecture Type | Key ISP Features | Perturbation Types Supported | Notable Applications |
|---|---|---|---|---|
| Geneformer [9] [31] | Transformer-based Encoder | Predicts direction of cell state shift (e.g., toward activation or rest); Can be used in open or closed-loop modes. | Knockout, Overexpression | T-cell activation studies, RUNX1-familial platelet disorder target identification. |
| scGPT [11] [31] | GPT-like Decoder | Predicts post-perturbation transcriptomes; Can be combined with a linear decoder for perturbation tasks. | Single/double gene knockout | Benchmarking studies on CRISPRa/i datasets. |
| scTenifoldKnk [32] | Tensor-based Workflow | Constructs Gene Regulatory Networks (GRNs) from WT data; virtually deletes a gene from the GRN to identify differentially regulated genes. | Virtual knockout | Systematic virtual KO analysis; recapitulation of real-animal KO findings. |
| Large Perturbation Model (LPM) [31] | PRC-disentangled Decoder | Integrates diverse perturbation data (genetic, chemical); disentangles Perturbation, Readout, and Context dimensions. | CRISPR, Chemical compounds | Predicting outcomes of unobserved experiments, mapping compound-CRISPR shared space. |
The core of the ISP workflow involves applying the selected and configured model to simulate the genetic perturbation.
Simulating overexpression often uses similar underlying architectures as knockout simulations. The key difference lies in how the perturbation is represented to the model. Instead of removing a gene's influence, the model is instructed to predict the transcriptional consequences of elevated expression of the target gene. For example, in Geneformer, this involves inputting a command to overexpress the gene and analyzing the predicted shift in the cell's embedding within the state space [9].
A significant advancement in ISP is the "closed-loop" framework, which iteratively improves model predictions by incorporating experimental data [9]. The process is as follows:
This framework has been shown to increase the Positive Predictive Value (PPV) of ISP three-fold, from 3% to 9%, while also improving sensitivity and specificity. Notably, performance improvements can saturate with as few as 20 experimental perturbation examples incorporated during fine-tuning [9].
Rigorous benchmarking is essential to assess the predictive power and limitations of ISP methods. A critical finding from recent large-scale benchmarks is that the performance of complex deep learning models must be compared against deliberately simple baselines [11].
A comprehensive evaluation of ISP models requires multiple metrics to capture different aspects of performance, as summarized in the table below.
Table 2: Key Metrics for Evaluating In Silico Perturbation Predictions
| Metric | Definition | Interpretation in ISP Context | Key Findings from Recent Studies |
|---|---|---|---|
| L2 Distance / R² | Measures the overall agreement between predicted and observed gene expression values. | Assesses general transcriptome-wide prediction accuracy. | High R² does not guarantee good performance in identifying biologically significant changes [33]. Complex models do not consistently outperform simple additive baselines for double perturbation prediction [11]. |
| Area Under the Precision-Recall Curve (AUPRC) | Evaluates the precision and recall of identifying Differentially Expressed (DE) genes. | Directly measures the ability to detect biologically relevant, perturbed genes. | Models with high R² can have low AUPRC, highlighting the metric's importance for biologically relevant assessment [33]. |
| Positive Predictive Value (PPV) | The proportion of predicted positive effects that are true positives. | Indicates the reliability of a predicted hit (e.g., a gene that shifts cell state). | Open-loop ISP had a PPV of 3% for T-cell activation, which increased to 9% with closed-loop fine-tuning [9]. |
| Sensitivity / Recall | The proportion of true positives correctly identified. | Measures the model's ability to find all relevant hits. | Improved from 48% (open-loop) to 76% (closed-loop) in T-cell activation studies [9]. |
| Specificity | The proportion of true negatives correctly identified. | Measures the model's ability to rule out non-hits. | Improved from 60% (open-loop) to 81% (closed-loop) in T-cell activation studies [9]. |
Benchmarking studies have established that simple models provide a crucial baseline for evaluation [11]:
A landmark study found that none of the five tested foundation models and two other deep learning models outperformed these simple baselines in predicting transcriptome changes after double perturbations [11]. This underscores the importance of critical benchmarking and the need for continued method development.
This protocol details the application of the closed-loop ISP framework for target discovery in RUNX1-Familial Platelet Disorder (RUNX1-FPD), a rare hematologic disease [9].
This protocol is designed for predicting genetic interactions and the effects of combinatorial gene perturbations [11] [32].
Successful implementation of ISP workflows relies on a combination of computational tools, biological data, and experimental reagents. The following table catalogues essential resources for the field.
Table 3: Essential Research Reagents and Resources for In Silico Perturbation
| Category | Item / Resource | Specifications / Example | Function in ISP Workflow |
|---|---|---|---|
| Computational Tools & Models | Geneformer [9] [31] | A transformer model pre-trained on millions of single-cell transcriptomes. | Fine-tuned for predicting direction of cell state change upon perturbation. |
| scGPT [11] [31] | A generative pre-trained transformer model for single-cell biology. | Predicts high-dimensional transcriptome changes after genetic perturbations. | |
| scTenifoldKnk [32] | A machine learning workflow for virtual KO using tensor decomposition and manifold alignment. | Performs virtual KO analysis using only WT scRNA-seq data to infer gene function. | |
| Large Perturbation Model (LPM) [31] | A decoder-only model that disentangles Perturbation, Readout, and Context. | Integrates diverse perturbation data types (genetic, chemical) for outcome prediction. | |
| Data Resources | CZ CELLxGENE [1] | A platform providing unified access to over 100 million annotated single cells. | Source of diverse, high-quality scRNA-seq data for model pre-training and fine-tuning. |
| Perturb-seq Datasets [11] [9] | e.g., Norman et al., Replogle et al. | Provides ground-truth scRNA-seq data from genetic screens for model training and benchmarking. | |
| Experimental Reagents (for Validation) | CRISPR Activation/Interference (CRISPRa/i) | e.g., dCas9-VPR, dCas9-KRAB systems. | For experimental validation of ISP predictions via targeted gene overexpression or knockdown. |
| Primary Human T-cells [9] | Isolated from healthy donors. | A biologically relevant system for validating ISP predictions related to immune activation. | |
| Engineered Human HSCs [9] | e.g., RUNX1-knockout models of RUNX1-FPD. | A disease model for validating ISP-predicted therapeutic targets in a rare genetic disorder. |
The In Silico Perturbation workflow, powered by single-cell foundation models, provides a powerful and scalable framework for simulating genetic knockouts and overexpression. While current models show promise, benchmarking reveals that their performance against simple baselines requires careful evaluation [11]. The adoption of a closed-loop framework, which incorporates experimental data into model fine-tuning, significantly enhances prediction accuracy and represents a crucial step toward realizing the potential of "virtual cell" models for biomedical discovery [9]. As models evolve and integrate more diverse data types [31], ISP is poised to become an indispensable tool for functional genomics and therapeutic target identification.
The ability to accurately predict how a cell will respond to a genetic or chemical perturbation represents a significant unsolved challenge in biology with profound implications for understanding disease mechanisms and accelerating therapeutic development. Single-cell foundation models (scFMs) have emerged as powerful deep learning tools pre-trained on vast amounts of single-cell transcriptomics data, enabling in silico perturbation (ISP) predictions that simulate cellular responses without extensive experimental validation [9] [1]. These models represent an important step toward creating "virtual cells" that can simulate cellular responses to diverse perturbations, holding particular value for rare diseases where patient samples are scarce and experimental screening is challenging [9].
However, current "open-loop" scFMs face a critical limitation: while they generate predictions that can be experimentally tested, they cannot learn from these experiments to create better predictions [9]. This open-loop approach leaves a significant gap between computational prediction and experimental validation. Closing this loop represents a crucial step toward realizing the full potential of virtual cell models for biomedical discovery. This protocol details the methodology for implementing a closed-loop framework that extends scFMs by incorporating experimental perturbation data during model fine-tuning, substantially improving prediction accuracy and biological relevance [9].
Single-cell foundation models typically employ transformer-based architectures that learn from massive single-cell datasets through self-supervised pretraining [1]. In these models, individual cells are treated analogously to sentences, and genes or genomic features along with their expression values are treated as words or tokens [1]. The model learns fundamental principles of cellular organization that can be generalized to new datasets or downstream tasks through attention mechanisms that weight relationships between gene tokens [1].
Two predominant architectural approaches have emerged:
For perturbation prediction, these models are typically fine-tuned on specific cellular states and then used to simulate the effects of genetic perturbations such as gene knockouts or overexpression [9].
Despite their theoretical promise, critical benchmarking studies reveal significant limitations in current scFMs for perturbation prediction. A comprehensive assessment published in Nature Methods demonstrated that five foundation models and two other deep learning models failed to outperform deliberately simple baselines for predicting transcriptome changes after single or double perturbations [11]. The simple "additive" model that predicts the sum of individual logarithmic fold changes consistently outperformed more complex deep learning approaches [11].
Similarly, the PertEval-scFM benchmarking framework found that scFM embeddings offer limited improvement over simple baseline models in zero-shot settings, particularly under distribution shift [16]. These findings highlight the ongoing challenges in perturbation effect prediction and underscore the need for frameworks that can enhance model performance through iterative improvement.
The closed-loop framework introduces an iterative feedback mechanism wherein experimental perturbation data is incorporated into model fine-tuning, creating a cycle of continuous improvement between in silico predictions and experimental validation [9]. This approach fundamentally transforms scFMs from static prediction tools into adaptive learning systems that become increasingly accurate with each experimental cycle.
Table 1: Key Performance Improvements with Closed-Loop Framework in T-cell Activation Model
| Metric | Open-Loop ISP | Closed-Loop ISP | Improvement |
|---|---|---|---|
| Positive Predictive Value (PPV) | 3% | 9% | 3-fold increase |
| Negative Predictive Value (NPV) | 98% | 99% | 1% increase |
| Sensitivity | 48% | 76% | 58% increase |
| Specificity | 60% | 81% | 35% increase |
| AUROC | 0.63 | 0.86 | 36% increase |
In Silico Perturbation:
Experimental Validation:
Perturbation Data Incorporation:
Iterative Refinement:
Table 2: Minimum Perturbation Examples Required for Performance Improvement
| Number of Examples | Sensitivity | Specificity | Performance Level |
|---|---|---|---|
| 10 examples | 61% (95% CI: 58-64%) | 66% (95% CI: 62-70%) | Substantial improvement |
| 20 examples | 76% (95% CI: 72-78%) | 79% (95% CI: 75-83%) | Performance saturation |
| >20 examples | No significant improvement | No significant improvement | Diminishing returns |
RUNX1-familial platelet disorder (RUNX1-FPD) is a rare pediatric hematologic disease affecting approximately 20,000 people in the US, characterized by thrombocytopenia, impaired platelet function, and increased risk of early-onset myeloid neoplasms [9]. Currently, no interventions exist to prevent progression to myeloid malignancies [9].
Model System Development:
Model Fine-tuning:
Therapeutic Target Identification:
Table 3: Essential Research Reagents for Closed-Loop Framework Implementation
| Reagent/Category | Specific Examples | Function/Application |
|---|---|---|
| Single-cell Foundation Models | Geneformer-30M-12L, scGPT, scFoundation | Base models for fine-tuning and perturbation prediction [9] [11] |
| Genetic Perturbation Systems | CRISPRi, CRISPRa, Perturb-seq | Experimental generation of perturbation data for model training [9] |
| Validation Assays | Flow cytometry (IL-2, IFN-γ production), scRNA-seq | Orthogonal validation of in silico predictions [9] |
| Cell Model Systems | Primary human T cells, RUNX1-engineered HSCs | Biological contexts for model development and testing [9] |
| Computational Frameworks | PertEval-scFM | Benchmarking and evaluation of perturbation predictions [16] |
The implementation of the closed-loop framework in T-cell activation models demonstrated a three-fold increase in positive predictive value (from 3% to 9%) with concurrent improvements in negative predictive value (99%), sensitivity (76%), and specificity (81%) compared to open-loop approaches [9]. The area under the receiver operator characteristic curve (AUROC) significantly improved from 0.63 (95% CI: 0.58-0.68) to 0.86 (95% CI: 0.83-0.89) [9].
Application to RUNX1-FPD identified novel therapeutic targets and pathways, including:
From these targets, eight genes with available specific small molecule inhibitors were selected for experimental validation, including PRKCB and UBB [9]. This demonstrates the framework's potential for accelerating rare disease drug discovery by prioritizing the most promising therapeutic targets for experimental validation.
The closed-loop framework for integrating experimental perturbation data into scFM fine-tuning represents a significant advancement in in silico perturbation modeling. By creating an iterative feedback loop between computational predictions and experimental validation, this approach substantially improves prediction accuracy and biological relevance. The methodology detailed in this protocol provides researchers with a standardized approach for implementing this framework across diverse biological contexts and disease models.
Future development should focus on expanding the framework to incorporate diverse data modalities, improving model architectures specifically for perturbation prediction, and addressing current limitations identified in benchmarking studies [11] [16]. As these frameworks mature, they hold tremendous promise for accelerating therapeutic discovery, particularly for rare diseases where conventional screening approaches are impractical.
RUNX1-Familial Platelet Disorder (RUNX1-FPD) is a rare autosomal dominant inherited condition characterized by thrombocytopenia, impaired platelet function, and a pronounced predisposition to develop myeloid malignancies, most commonly myelodysplastic syndrome (MDS) and acute myeloid leukemia (AML) [34]. The disease is caused by germline loss-of-function mutations in the RUNX1 gene, a crucial transcription factor in hematopoiesis. The estimated risk of progressing to a myeloid malignancy is approximately 40%, with a median age of onset of 33 years, though cases have been reported from age 2 to 72 [34] [35]. Affecting over 18,000 people in the United States, RUNX1-FPD presents significant clinical challenges due to the scarcity of patient samples and the lack of interventions to prevent leukemic transformation [9].
The clinical presentation is marked by significant phenotypic heterogeneity, even among family members carrying the identical RUNX1 mutation. A documented case study illustrates this variability: a 5-year-old boy presented with isolated thrombocytopenia, his mother developed MDS at 27 years, while his maternal grandfather remained asymptomatic with a normal platelet count at 60 years of age [34]. This heterogeneity complicates clinical prognosis and underscores the need for personalized therapeutic strategies. The molecular pathogenesis often involves subsequent somatic mutations in genes such as BCOR, PTPN11, KRAS, and TET2, which likely contribute to disease progression [34].
In Silico Perturbation (ISP) with single-cell foundation models (scFMs) represents a paradigm shift in biomedical research. scFMs are large-scale deep learning models, typically based on Transformer architectures, pre-trained on vast single-cell RNA sequencing (scRNA-seq) datasets. They learn the fundamental "language" of cells, where individual cells are treated as sentences and genes or genomic features as words [1] [36]. A key application is ISP, which simulates cellular responses to genetic perturbations (e.g., gene knockouts or overexpression) computationally, acting as a "virtual cell" platform [9]. This is particularly valuable for rare diseases like RUNX1-FPD, where experimental screening with patient samples is severely limited.
The standard open-loop ISP approach involves fine-tuning an scFM, such as Geneformer, on a target cellular state (e.g., RUNX1-knockout Hematopoietic Stem Cells (HSCs) vs. controls) and then predicting genes that, when perturbed, shift the diseased state toward a healthy one [9]. However, this open-loop paradigm has a critical limitation: its predictions are made in a vacuum, without the ability to learn from subsequent experimental validation.
The closed-loop ISP framework introduces a crucial iterative feedback mechanism. After the initial ISP predictions are generated, they are experimentally tested. The scRNA-seq data from these experimental perturbations are then incorporated back into the model during a subsequent fine-tuning round. This "closes the loop," allowing the model to learn from empirical data and refine its predictive capabilities [9]. The entire workflow, from model setup to therapeutic discovery, is outlined below.
The implementation of the closed-loop framework demonstrates a substantial quantitative improvement over traditional open-loop ISP. In the context of T-cell activation, a model system for benchmarking, the incorporation of even a small number of experimental perturbation examples during fine-tuning dramatically enhanced predictive performance [9].
Table 1: Performance Comparison of Open-Loop vs. Closed-Loop ISP for T-cell Activation (based on [9])
| Metric | Open-Loop ISP | Closed-Loop ISP | Relative Improvement |
|---|---|---|---|
| Positive Predictive Value (PPV) | 3% | 9% | 3-fold increase |
| Negative Predictive Value (NPV) | 98% | 99% | Marginal improvement |
| Sensitivity | 48% | 76% | 1.6-fold increase |
| Specificity | 60% | 81% | 1.35-fold increase |
| AUROC | 0.63 | 0.86 | 36% increase |
A critical finding was the data efficiency of the closed-loop approach. Performance metrics improved dramatically with just 10 perturbation examples (Sensitivity: 61%, Specificity: 66%) and began to saturate after incorporating approximately 20 examples (Sensitivity: 76%, Specificity: 79%). This indicates that even a modest number of experimental validations can substantially enhance model accuracy, making the approach feasible for research on rare diseases where data is scarce [9].
Applying the closed-loop ISP framework to RUNX1-FPD, researchers began by fine-tuning the Geneformer model on human HSCs engineered with RUNX1 loss-of-function mutations, which showed high concordance with patient-derived HSCs [9]. The model was tasked with identifying genes whose deletion would shift the RUNX1-knockout HSCs toward a control-like state.
The initial open-loop ISP, combined with differential expression (DE) analysis, identified 14 high-confidence candidate genes predicted by both methods. From this list, eight genes with available specific small molecule inhibitors were selected for further investigation [9]. The closed-loop process helped prioritize the most promising therapeutic targets and pathways.
Table 2: Therapeutic Targets and Pathways Identified via Closed-Loop ISP for RUNX1-FPD (based on [9])
| Category | Target/Pathway | Potential Therapeutic Agent | Proposed Mechanism |
|---|---|---|---|
| Primary Targets | mTOR signaling | mTOR inhibitors (e.g., Rapamycin) | Corrects dysregulated protein synthesis and cell growth in RUNX1-deficient HSCs. |
| CD74-MIF signaling axis | MIF inhibitors | Modulates inflammatory signaling implicated in the disease phenotype. | |
| Novel Pathways | Protein Kinase C (PKC) | PKC inhibitors | Targets dysregulated intracellular signal transduction. |
| Phosphoinositide 3-Kinase (PI3K) | PI3K inhibitors | Acts on a key signaling pathway downstream of multiple receptors. |
The following diagram illustrates the signaling pathways identified as potential therapeutic targets for RUNX1-FPD, highlighting the points of intervention for small molecule inhibitors.
This protocol describes the initial fine-tuning of a pre-trained single-cell foundation model to establish a baseline for in silico perturbation predictions in RUNX1-FPD.
Materials:
Procedure:
This protocol details the iterative process of generating ISP predictions, experimentally testing them, and refining the model.
Materials:
Procedure:
Experimental Perturbation:
Closed-loop Fine-tuning:
Refined (Closed-loop) ISP:
Table 3: Essential Research Reagents and Computational Tools for Closed-Loop ISP
| Category / Item | Specific Example(s) | Function and Application |
|---|---|---|
| Single-Cell Foundation Models | Geneformer-30M-12L, scGPT, scFoundation [37] | Pre-trained models providing the base for fine-tuning and ISP tasks on single-cell data. |
| Computational Framework | Closed-loop ISP custom code (PyTorch) [9] | Software environment for model fine-tuning, running in silico perturbations, and integrating new data. |
| RUNX1-FPD Cell Model | Human HSCs with RUNX1 loss-of-function (CRISPR/Cas9) [9] | Biologically relevant in vitro system to model the disease and validate predictions. |
| Perturbation Screening Tool | CRISPR activation/interference (CRISPRa/i) with Perturb-seq [9] | Technology for experimentally perturbing candidate genes and measuring genome-wide effects at single-cell resolution. |
| Key Therapeutic Inhibitors | mTOR inhibitors, MIF inhibitors, PKC inhibitors, PI3K inhibitors [9] | Small molecules used for functional validation of predicted therapeutic targets in vitro and in vivo. |
The application of the closed-loop ISP framework to RUNX1-Familial Platelet Disorder represents a significant advancement in computational biology and rare disease research. By iteratively refining a single-cell foundation model with empirical data from targeted perturbations, this approach transforms the "virtual cell" from a static predictor into a dynamic, learning system. The method successfully identified several high-priority therapeutic targets, including the mTOR and CD74-MIF signaling axes, demonstrating the potential of AI-driven in silico discovery to accelerate the development of much-needed interventions for patients with this high-risk predisposition syndrome. This closed-loop paradigm is broadly applicable to a wide range of other genetic diseases, heralding a new era where computational models and experimental biology are tightly integrated to decipher and treat complex medical conditions.
The accurate in silico prediction of combinatorial genetic perturbation effects represents a cornerstone for advancing functional genomics and therapeutic discovery. Within the broader thesis of in silico perturbation modeling using single-cell Foundation Models (scFMs), this application note details the current computational landscape, performance benchmarks, and standardized protocols for modeling these complex biological interactions. The ability to simulate genetic interactions and synergistic drug effects enables researchers to prioritize experimental work, elucidate functional genetic networks, and identify novel therapeutic combinations with reduced experimental burden.
Recent benchmarking studies reveal a critical insight: despite their architectural complexity, many deep-learning foundation models do not consistently outperform deliberately simple linear baselines in predicting transcriptome-wide perturbation outcomes [11]. The field is rapidly evolving, with new architectures like the Large Perturbation Model (LPM) showing promise by explicitly disentangling Perturbation, Readout, and Context (PRC) dimensions, thereby enabling the integration of heterogeneous experimental data across diverse perturbations (e.g., CRISPR, chemical), readouts (e.g., transcriptomics, viability), and biological contexts [4].
Table 1: Benchmarking of Perturbation Prediction Models summarizes quantitative performance comparisons across key methodologies. Performance is typically measured using the Pearson correlation between predicted and observed gene expression values for held-out perturbations.
Table 1: Benchmarking of Perturbation Prediction Models
| Model | Model Type | Key Innovation | Reported Performance (Pearson r) | Data Modalities Supported |
|---|---|---|---|---|
| Large Perturbation Model (LPM) [4] | PRC-Disentangled Deep Learning | Disentangles Perturbation, Readout, Context | State-of-the-art (exact values not provided) | Genetic & Chemical; Transcriptomics & Viability |
| GPerturb [38] | Gaussian Process Regression | Sparse, interpretable effects with uncertainty estimates | 0.981 (Replogle), 0.979 (Norman) | Single-cell CRISPR screens (count & continuous data) |
| CPA [11] | Autoencoder | Predicts combinatorial & dose-dependent effects | Outperformed by linear baselines in double perturbation [11] | Continuous expression, dosages |
| GEARS [11] | Graph-Enhanced Deep Learning | Incorporates Gene Ontology knowledge graphs | Outperformed by linear baselines [11] | Discrete genetic perturbations |
| scGPT / Geneformer [11] | Single-cell Foundation Models | Transformer-based pretrained on scRNA-seq data | Did not outperform simple additive baseline [11] | Transcriptomics |
| Additive Baseline [11] | Simple Linear Model | Sum of individual logarithmic fold changes (LFCs) | Benchmark for double perturbations [11] | Gene expression |
A significant challenge in the field is the prediction of genetic interactions, where the effect of a double perturbation deviates from the expected combination of single effects. In a benchmark using data from Norman et al., which included 124 double gene perturbations in K562 cells, models like GEARS, scGPT, and scFoundation were unable to outperform a simplistic "no change" or "additive" baseline in identifying these interactions [11]. Furthermore, most models demonstrated a strong bias towards predicting "buffering" interactions and were notably poor at identifying the rarer "synergistic" interactions correctly [11].
Table 2: Essential Research Reagents and Computational Tools
| Reagent / Tool | Function / Description | Application in Perturbation Modeling |
|---|---|---|
| CROP-seq / Perturb-seq [38] | Single-cell RNA-seq technology coupling CRISPR perturbations with transcriptomic readouts. | Generates high-throughput training and validation data for models. |
| LINCS Datasets [4] | Library of Integrated Network-Based Cellular Signatures; contains genetic and pharmacological perturbation data. | Used for training cross-modal models like LPM. |
| PhenotypeGenetics Software [39] | Open-source, cross-platform software for deriving genetic-interaction networks from quantitative phenotype data. | Computationally assigns interaction modes from phenotype inequalities. |
| Gene Ontology (GO) Annotations [11] | Structured, controlled vocabularies of gene and gene product attributes. | Used by models like GEARS to inform gene relationships for predicting unseen perturbations. |
| DrugComboRanker / AuDNNsynergy [40] | AI-based algorithms for predicting synergistic and antagonistic drug combinations. | Applied in multi-omics drug discovery for anti-cancer and antimicrobial therapy optimization. |
Purpose: To objectively evaluate the performance of a new perturbation prediction model against established baselines using a standardized dataset of single and double genetic perturbations.
Background: This protocol is adapted from benchmarks performed in [11], which highlighted the critical importance of comparing against simple baselines.
Materials:
Procedure:
Experimental Setup:
Model Training and Fine-tuning:
Performance Evaluation:
Troubleshooting:
Purpose: To train an LPM for multi-task biological discovery, including predicting effects of unseen perturbations and mapping shared mechanisms between chemical and genetic perturbations.
Background: The LPM architecture integrates heterogeneous data by treating Perturbation (P), Readout (R), and Context (C) as separate, disentangled conditioning variables [4].
Materials:
Procedure:
Model Training:
Model Application for Discovery:
Purpose: To systematically classify genetic interaction modes from quantitative phenotype data using the PhenotypeGenetics framework.
Background: This classical, computation-based method defines genetic interactions through inequalities between the phenotypes of wild-type, single-mutant, and double-mutant genotypes [39]. It provides a model-agnostic way to establish ground truth for interactions.
Materials:
Procedure:
Classification of Interaction Modes:
Network Construction and Analysis:
The drug discovery landscape for rare diseases is fraught with challenges, including small patient populations, limited access to biological samples, and often poorly understood pathophysiology [41]. In silico technologies, particularly single-cell foundation models (scFMs), are emerging as powerful tools to overcome these barriers by enabling the prediction of cellular responses to genetic and chemical perturbations. These "virtual cell" models provide a scalable, human-relevant platform for identifying and prioritizing therapeutic targets, especially where experimental screening with scarce patient samples is unfeasible [9]. This Application Note details protocols for leveraging in silico perturbation modeling to accelerate target identification and validation for rare diseases, providing a structured framework for researchers and drug development professionals.
Computational approaches are being deployed across the rare disease research and development continuum. The table below summarizes the key contexts of use (CoUs) for in silico technologies, highlighting their specific applications and the methodologies employed.
Table 1: Contexts of Use for In Silico Technologies in Rare Disease Research
| Context of Use (CoU) | Primary Applications | Representative Methodologies |
|---|---|---|
| Diagnosis & Disease Characterization (CoU1) | Variant interpretation, phenotype mining, disease stratification [41] | AI-enhanced genomic pipelines (e.g., popEVE), NLP-EHR analysis, deep learning for pathogenicity prediction [41] [42] |
| Drug Discovery (CoU2) | Target identification/prioritization, virtual screening, drug repurposing [41] | Network pharmacology, AI-led target ID (e.g., PandaOmics), molecular docking, QSAR modeling [41] [43] |
| Preclinical Development (CoU3) | Disease mechanism modeling, biomarker nomination, efficacy prediction [41] | Single-cell Foundation Models (scFMs), Quantitative Systems Pharmacology (QSP), organoid-ML simulations [41] [9] |
| Clinical Trial Design (CoU4) | Virtual trials, synthetic control arms, pharmacokinetic/pharmacodynamic (PK/PD) modeling [41] | Pharmacometric models, PBPK, virtual patient cohort simulation [41] |
Single-cell foundation models (scFMs), such as Geneformer, are deep learning models pre-trained on vast amounts of single-cell transcriptomics data [9]. They can be fine-tuned for specific tasks, including in silico perturbation (ISP), which predicts how a genetic perturbation (e.g., gene knockout or overexpression) would alter a cell's transcriptomic state [9]. A critical advancement is the "closed-loop" framework, where the model iteratively incorporates experimental perturbation data during fine-tuning to significantly improve prediction accuracy [9].
This protocol outlines the steps for fine-tuning a scFM and implementing a closed-loop ISP to identify therapeutic targets for a rare disease.
I. Model Fine-Tuning for Disease State Classification
II. Open-Loop In Silico Perturbation Screening
III. Closing the Loop: Model Enhancement with Experimental Data
The following workflow diagram illustrates this closed-loop experimental protocol:
While scFMs hold immense promise, a critical appraisal of their performance against simpler models is essential for robust experimental design. Recent benchmarking studies reveal that the performance of complex deep-learning models for predicting perturbation effects is highly context-dependent.
Table 2: Model Performance Comparison for Perturbation Prediction
| Model / Baseline | Reported Performance | Context and Notes |
|---|---|---|
| Closed-loop scFM (Geneformer) | 3x increase in Positive Predictive Value (PPV) vs. open-loop (from 3% to 9%); High NPV (99%), Sensitivity (76%), Specificity (81%) [9] | Applied to T-cell activation and RUNX1-FPD; Performance improved with just ~20 perturbation examples [9]. |
| Open-loop scFM (Geneformer) | PPV: 3%; Negative Predictive Value (NPV): 98%; Sensitivity: 48%; Specificity: 60% [9] | Outperformed differential expression (DE) analysis for NPV, sensitivity, and specificity [9]. |
| 'Additive' Baseline Model | Lower prediction error (L2 distance) than 5 foundation models and 2 other deep learning models for predicting double perturbation effects [11] | Predicts double perturbation effects as the sum of individual logarithmic fold changes. Used no double perturbation data for training [11]. |
| 'No Change' Baseline Model | Performance equivalent or superior to deep learning models in predicting genetic interactions from double perturbations [11] | Always predicts the same expression as in the control condition. |
| Simple Linear Model | Outperformed or matched deep learning models in predicting effects of unseen single-gene perturbations [11] | Uses dimension-reducing embeddings of training data; performance can be enhanced with embeddings from foundation models [11]. |
The following table details key reagents and computational tools essential for conducting in silico perturbation studies for rare diseases.
Table 3: Essential Research Reagents and Tools for In Silico Perturbation Modeling
| Item / Resource | Function / Application | Example / Note |
|---|---|---|
| Pre-trained scFM | Provides a foundational understanding of gene-gene relationships from vast single-cell data; base for task-specific fine-tuning. | Geneformer [9], scGPT [11], scFoundation [11]. |
| Rare Disease Model scRNA-seq Data | Essential dataset for fine-tuning the scFM to recognize the specific disease pathophysiology. | Patient-derived cells or genetically engineered in vitro models (e.g., RUNX1-knockout HSCs) [9]. |
| Perturb-seq Data | Gold-standard experimental data used to "close the loop" and ground-truth model predictions, drastically improving accuracy. | CRISPR-based perturbation coupled with scRNA-seq [9]. |
| AI-Based Pathogenicity Predictor | Aids in initial variant prioritization and diagnosis (CoU1), helping to define the genetic basis of the rare disease. | popEVE model scores variants by disease likelihood [42]. |
| Network Analysis Platform | Identifies novel therapeutic targets and supports drug repurposing by analyzing interactions within biological systems. | PandaOmics for ALS [41], STRING, Cytoscape [41]. |
| Linear Model Baselines | Critical for benchmarking the performance of more complex deep learning models; ensures reported advances are meaningful. | 'Additive' and 'No Change' models, simple linear regression with embeddings [11]. |
Applying the closed-loop ISP framework to rare diseases like RUNX1-Familial Platelet Disorder (RUNX1-FPD) can identify key dysregulated signaling pathways. The model nominates specific genes within these pathways whose perturbation can shift the diseased state toward normal, highlighting them as potential therapeutic targets [9].
The diagram below summarizes the key signaling pathways and candidate therapeutic targets identified for RUNX1-FPD using this approach:
Rare diseases and research involving challenging primary patient samples present a major obstacle in biomedical research: the profound scarcity of biological material. This scarcity limits the application of traditional high-throughput screening methods for target discovery and therapeutic development. The emergence of in silico perturbation modeling, particularly using single-cell Foundation Models (scFMs), provides a powerful framework to overcome these limitations. These technologies enable the virtual simulation of cellular and molecular responses to genetic or chemical perturbations, dramatically reducing the experimental burden on precious samples [44]. This Application Note details protocols for employing these computational strategies to conduct virtual screens and derive biologically meaningful insights from limited datasets, thereby accelerating research for rare conditions and complex diseases.
Several advanced computational frameworks now enable the prediction of cellular responses to perturbations. The choice of model depends on the type of available data and the specific biological question. The core capability of these models is to learn the underlying "rules" of cellular biology from large-scale existing data and apply them to a specific, data-scarce context of interest.
The Large Perturbation Model (LPM) is a deep-learning architecture specifically designed to integrate heterogeneous perturbation experiments. Its key innovation is the disentanglement of the Perturbation (P), Readout (R), and experimental Context (C) into separate dimensions [45].
The following diagram illustrates the core architecture and workflow of an LPM for in silico discovery.
For research involving high-content imaging, the IMage Perturbation Autoencoder (IMPA) offers a solution for predicting morphological responses. IMPA is a generative style-transfer model that decomposes a cell image into a content component (the cell's identity) and a style component (the perturbation effect) [46].
The field is moving towards standardized benchmarking to accelerate development. Initiatives like the Arc Institute's Virtual Cell Challenge provide community-wide competitions to stress-test models on their ability to generalize to new cell contexts and predict the effects of single gene perturbations [47]. Furthermore, the concept of a "Virtual Cell" extends beyond prediction to include the explanation of underlying mechanisms and the discovery of novel biology, forming a Predict-Explain-Discover (P-E-D) cycle that is highly valuable for drug discovery [48].
Table 1: Comparison of In Silico Perturbation Modeling Frameworks
| Framework | Core Architecture | Input Data Modality | Primary Output | Key Advantage for Sample Scarcity |
|---|---|---|---|---|
| Large Perturbation Model (LPM) [45] | PRC-disentangled, decoder-only deep learning | Transcriptomics, Viability | Predicted post-perturbation readout (e.g., gene expression) | Integrates data from diverse contexts; predicts for unseen perturbations. |
| IMPA [46] | Conditional Generative Adversarial Network (GAN) | High-content microscopy images | Synthetic image of perturbed cell | Predicts morphological effects without needing paired before/after image data. |
| scGPT / Geneformer [45] | Transformer-based encoder | Single-cell transcriptomics | Cell and gene embeddings | Can be fine-tuned on small datasets for context-specific predictions. |
| VirtuDockDL [49] | Graph Neural Network (GNN) | Chemical structures (SMILES) | Predicted binding affinity / activity | Accelerates virtual screening of compound libraries against a protein target. |
This protocol is designed to identify potential drug candidates for a target of interest (e.g., a protein implicated in a rare disease) using a machine learning (ML)-based classifier, minimizing the need for wet-lab screening until the final stages [50] [51].
1.0. To mitigate bias, consider alternative decoy strategies such as using Dark Chemical Matter (DCM) or random selections from the ZINC15 database [52].The workflow for this integrated computational screening process is summarized below.
This protocol uses a pre-trained LPM to identify existing drugs that might be effective against a rare disease cell type.
P) as a library of approved drugs or a specific compound of interest.R) as a transcriptomic profile or cell viability measurement.C) as the specific challenging cell type or patient-derived sample.Table 2: Essential Computational Tools and Databases
| Tool / Database | Type | Primary Function in Protocol | Reference/Access |
|---|---|---|---|
| RDKit | Cheminformatics Software | Computes molecular descriptors and fingerprints from SMILES strings for ML model training. | [49] [50] |
| ChEMBL / BindingDB | Bioactivity Database | Source of known active compounds for a target; used to create labeled training data for ML models. | [50] [51] |
| Directory of Useful Decoys-Enhanced (DUD-E) | Decoy Compound Database | Provides physicochemically matched, presumed inactive molecules to balance ML training sets. | [50] [52] |
| ZINC15 | Commercial Compound Database | Source of purchasable, drug-like molecules for virtual screening and decoy selection. | [52] |
| scikit-learn | Machine Learning Library | Provides implementations of Random Forest, SVM, and other algorithms for building classifiers. | [50] [51] |
| PADIF Fingerprint | Protein-Ligand Interaction Descriptor | Used to train target-specific ML scoring functions to improve virtual screening power. | [52] |
| Arc Virtual Cell Atlas | Transcriptomics Data Repository | Large-scale single-cell dataset for pre-training or fine-tuning perturbation models. | [47] |
While in silico models significantly de-risk experimentation, their predictions require rigorous validation before concluding.
A fundamental challenge in applying artificial intelligence to single-cell genomics lies in the non-sequential nature of gene expression data. Unlike natural language, where words follow a deterministic order, or images, where pixels have spatial relationships, the genes within a cell's transcriptome have no inherent sequence. This creates a significant obstacle for transformer-based architectures and other sequential models that require structured input. This Application Note outlines standardized protocols for processing, tokenizing, and analyzing non-sequential gene expression data within in silico perturbation modeling frameworks using single-cell Foundation Models (scFMs). The methodologies described herein enable researchers to transform unordered gene vectors into structured inputs suitable for advanced AI models, thereby facilitating more accurate predictions of cellular responses to genetic and chemical perturbations.
Table: Core Challenges of Non-Sequential Gene Expression Data
| Challenge | Description | Impact on Modeling |
|---|---|---|
| Lack of Native Ordering | Genes in expression vectors have no biological sequence. | Direct application of sequential models (e.g., transformers) is invalid. |
| Dimensionality | Profiling typically measures 20,000+ genes per cell. | Computationally intensive; requires robust feature selection. |
| Batch Effects | Technical variations between experiments. | Introduces spurious correlations; hinders model generalization. |
Tokenization is the critical process of converting raw gene expression data into discrete units (tokens) that scFMs can process. Since genes lack a natural sequence, a deterministic ordering must be imposed. The following protocol details the primary strategies identified in the literature for this purpose [1] [5].
This protocol creates a cell-specific sequence by ranking genes based on their expression magnitude, which serves as a consistent and biologically informative ordering system.
L (e.g., 1200 genes) based on model requirements and computational constraints.L detected genes, retain only the top L ranked genes.L detected genes, pad the sequence with a special <PAD> token or mask.The following diagram illustrates the workflow for the gene ranking tokenization strategy.
To evaluate the efficacy of different tokenization and modeling approaches in handling non-sequential data for perturbation tasks, a robust benchmarking framework is essential. The following protocol utilizes the BioLLM framework to ensure standardized and reproducible comparisons [5].
This protocol outlines the steps for performing a comparative analysis of different scFMs on a standardized perturbation dataset.
Environment and Data Setup
adata.obs['label'] to adata.obs['condition']).Model Initialization and Configuration
random_forest_classifier or similar estimator as the perturbation predictor for tasks like cell type prioritization.Feature Selection and Training
select_variance_feature=True: Uses the original Augur variance-based selection.scanpy.pp.highly_variable_genes: Uses Scanpy's method for faster, potentially inflated performance.Performance Evaluation
Table: Benchmarking Results of scFMs on Exemplar Tasks (Adapted from BioLLM [5])
| Model | Zero-shot Embedding Quality (ASW) | Perturbation Prediction (AUC) | Computational Efficiency |
|---|---|---|---|
| scGPT | 0.85 (Consistently highest) | 0.92 | High (Optimized memory/time) |
| Geneformer | 0.78 (Strong on gene-level tasks) | 0.87 | High |
| scFoundation | 0.75 | 0.84 | Moderate |
| scBERT | 0.65 (Lags behind peers) | 0.79 | Low |
Once a model has processed gene expression data into a structured format, it can be powerfully applied to predict the effects of perturbations. The following protocols detail this for two key tasks.
This protocol uses the Augur method to identify which cell types within a heterogeneous sample are most affected by a perturbation, based on the separability of their transcriptomic profiles [54].
control and stimulated) and cell type annotations.ag_rfc = pt.tl.Augur("random_forest_classifier")).label_col (perturbation condition) and cell_type_col.predict function with parameters like subsample_size=20 (number of cells per type) and n_threads=4 for parallelization. Use select_variance_features=True for high-resolution results.mean_augur_score (derived from AUC). Higher scores indicate cell types whose transcriptional state is more profoundly altered by the perturbation.GPerturb is a Gaussian process-based model that estimates sparse, interpretable gene-level perturbation effects, providing uncertainty estimates for its predictions [55].
GPerturb-ZIP) and continuous transformed data (GPerturb-Gaussian).Table: Key Computational Tools for scFM and Perturbation Modeling
| Tool Name | Type | Primary Function in Perturbation Modeling | Reference/Source |
|---|---|---|---|
| BioLLM | Software Framework | Unified interface for integrating and benchmarking multiple scFMs. | [5] |
| Pertpy | Python Toolkit | Provides perturbation analysis methods, including Augur. | [54] |
| scGPT | Foundation Model | Transformer-based scFM for cell and gene embedding; excels in multiple tasks. | [1] [5] |
| GPerturb | Perturbation Model | Gaussian process model for sparse, interpretable effect estimation with uncertainty. | [55] |
| CPA | Perturbation Model | Autoencoder to predict counterfactual expression under different perturbations. | [55] |
| CZ CELLxGENE | Data Catalog | Platform providing access to millions of curated single-cell datasets for pretraining. | [1] |
Effectively addressing the non-sequential nature of gene expression data is a cornerstone of modern computational biology. The tokenization strategies, benchmarking protocols, and specialized perturbation models detailed in this Application Note provide a robust and standardized pathway for researchers to leverage the full power of single-cell Foundation Models. By transforming unordered transcriptomic data into a structured format that AI models can interpret, we unlock the potential to perform high-fidelity in silico simulations of genetic and chemical perturbations. This capability is poised to dramatically accelerate therapeutic discovery and deepen our understanding of fundamental cellular processes.
In the field of in silico perturbation modeling with single-cell Foundation Models (scFMs), data quality is not merely a technical concern but a fundamental determinant of model reliability and biological insight. Batch effects, technical noise, and data inconsistencies represent significant challenges that can compromise the integrity of computational predictions. Batch effects are defined as unwanted technical variations introduced due to differences in laboratory conditions, instrumentation, reagent lots, or personnel [56] [57]. In the context of perturbation modeling, where the goal is to understand causal relationships by predicting system responses to interventions, these artifacts can create false predictions or obscure true biological signals [4] [56].
The integration of diverse, large-scale perturbation datasets is central to training robust large perturbation models (LPMs) and scFMs. These models learn to disentangle perturbation (P), readout (R), and context (C) dimensions to predict experimental outcomes [4]. However, this integration is critically dependent on data harmonization. Technical variations can severely hinder the model's ability to learn generalizable rules, leading to inaccurate predictions of post-perturbation cellular states and misidentification of molecular mechanisms [4] [56]. Therefore, implementing rigorous protocols for assessing and mitigating batch effects is a prerequisite for biologically meaningful in silico discovery.
Technical noise in single-cell and spatial transcriptomics arises from multiple sources throughout the experimental workflow. Major sources include variability in sample preparation protocols, differences in sequencing platforms and library preparation kits, reagent batch variations, and environmental conditions [56] [58]. In mass-spectrometry-based proteomics, the problem is further compounded by the multi-step data transformation process from spectra to peptides to proteins, creating multiple potential entry points for batch effects [59] [57].
The consequences of unaddressed batch effects are profound. They can:
The recently developed Large Perturbation Model (LPM) architecture demonstrates the critical importance of high-quality, harmonized data. LPMs integrate heterogeneous perturbation experiments by representing perturbation, readout, and context as disentangled dimensions [4]. This approach enables predicting outcomes of unobserved perturbation experiments and identifying shared molecular mechanisms across perturbation types. However, the model's performance is contingent on its ability to learn perturbation-response rules that are generalizable across contexts, a task severely hampered by unaddressed batch effects [4].
Effective batch effect correction begins with comprehensive assessment. Visual methods provide an intuitive first look at data structure and potential technical artifacts:
Beyond visual inspection, quantitative metrics provide objective measures of batch effect severity and correction efficacy:
Table 1: Quantitative Metrics for Assessing Batch Effects
| Metric | Description | Interpretation | Ideal Value |
|---|---|---|---|
| Average Silhouette Width (ASW) | Measures clustering tightness and separation | Higher values indicate better batch mixing while preserving cell type identity | High for cell type, low for batch |
| Adjusted Rand Index (ARI) | Measures similarity between two clusterings | Higher values indicate better preservation of biological clusters after correction | Close to 1 |
| Local Inverse Simpson's Index (LISI) | Quantifies diversity of batches in local neighborhoods | Higher values indicate better batch mixing | High |
| kBET Acceptance Rate | Tests whether batch labels are random in local neighborhoods | Higher rates indicate successful batch mixing | Close to 1 |
These metrics should be applied both before and after correction to quantitatively evaluate the effectiveness of the chosen batch effect correction strategy [58].
The most effective approach to batch effects is prevention through careful experimental design:
When batch effects cannot be prevented through design alone, computational correction is necessary. The workflow differs for various data types and analytical goals:
Diagram 1: BE Correction Workflow - This diagram outlines the decision process for implementing batch effect correction in omics data analysis, from initial assessment through validation for in silico modeling.
For mass-spectrometry-based proteomics data, recent benchmarking studies using the Quartet protein reference materials provide clear guidance:
Data Level Selection: Protein-level correction consistently demonstrates superior robustness compared to precursor or peptide-level correction across multiple quantification methods and batch effect correction algorithms [59].
Algorithm Selection: Test multiple algorithms as performance varies by context:
Performance Validation: Assess correction using:
Table 2: Batch Effect Correction Algorithms and Their Applications
| Algorithm | Mechanism | Best For | Considerations |
|---|---|---|---|
| ComBat | Empirical Bayes adjustment for known batches | Bulk data with defined batch structure | Requires known batch info; may not handle nonlinear effects [59] [58] |
| SVA | Estimates hidden sources of variation | When batch variables are unknown | Risk of removing biological signal; requires careful modeling [58] [57] |
| Harmony | Iterative clustering in reduced dimension space | Single-cell data integration | Preserves biological variation while aligning batches [59] [58] |
| Ratio-based | Sample intensity relative to reference standards | Multi-batch proteomics studies | Requires universal reference materials [59] |
| WaveICA2.0 | Multi-scale decomposition with injection order | MS-data with signal drift over time | Addresses continuous drift effects in large sample sets [59] [57] |
For single-cell RNA-seq data in perturbation studies:
Preprocessing: Quality control to remove low-quality cells (high mitochondrial percentage, low gene counts) [60]
Integration: Use Harmony or similar algorithms to align cells across batches while preserving biological heterogeneity [60] [58]
Validation: Apply both visual (UMAP) and quantitative (LISI, ASW) metrics to ensure batch mixing without biological signal loss [58]
Single-cell Foundation Models and Large Perturbation Models have specific data quality requirements that must be addressed through proper batch correction:
Disentangled Representations: LPMs explicitly disentangle perturbation, readout, and context dimensions. Batch effects can blur these distinctions, reducing model accuracy in predicting outcomes for unseen perturbations [4].
Cross-Platform Generalization: Effective perturbation models must generalize across experimental contexts. Batch effects that correlate with platform-specific factors hinder this capability [4].
Mechanistic Insight: Quality-controlled data enables LPMs to accurately associate genetic and chemical perturbations that share molecular mechanisms, as demonstrated by the clustering of mTOR inhibitors with genetic perturbations targeting MTOR in the learned embedding space [4].
Implement these quality control checkpoints before training or fine-tuning perturbation models:
Diagram 2: LPM Modeling Pipeline - This diagram shows how batch-corrected data feeds into Large Perturbation Model training and enables multiple biological discovery tasks that ultimately validate therapeutic hypotheses.
Table 3: Research Reagent Solutions for Quality-Assured Perturbation Modeling
| Resource | Type | Function | Example Use Case |
|---|---|---|---|
| Quartet Reference Materials | Biological standards | Multi-level quality control for proteomics | Assessing batch effect correction efficacy across labs [59] |
| SuPreMo Tool | Computational framework | In silico mutagenesis and sequence perturbation | Generating variant sequences for input to predictive models [61] |
| SingleR | Cell type annotation | Automated cell type identification | Ensuring consistent cell labeling across batches [60] |
| CellChat | Cell communication analysis | Inference of intercellular signaling networks | Studying how perturbations affect cell-cell communication [60] |
| InferCNV | Copy number variation analysis | Detection of CNVs from scRNA-seq data | Distinguishing malignant from non-malignant cells in tumor samples [60] |
| Harmony | Batch integration algorithm | Aligning datasets in reduced dimension space | Integrating single-cell data across multiple patients or conditions [60] |
Mitigating data quality issues is not a mere preprocessing step but a foundational requirement for robust in silico perturbation modeling. As single-cell Foundation Models and Large Perturbation Models continue to advance in sophistication and application scope, the integrity of their predictions will remain critically dependent on the quality of their training data. By implementing the systematic assessment protocols, correction strategies, and validation frameworks outlined in this document, researchers can significantly enhance the reliability of their computational models. This rigorous approach to data quality ensures that model predictions reflect genuine biology rather than technical artifacts, ultimately accelerating the discovery of novel therapeutic targets and biological mechanisms through more trustworthy in silico experimentation.
The application of single-cell foundation models (scFMs) has revolutionized our ability to interpret cellular heterogeneity and complex regulatory networks, positioning them as pivotal tools in computational biology and drug discovery [1]. These models, typically built on transformer architectures, are pretrained on vast datasets encompassing millions of single-cell transcriptomes to learn fundamental biological principles [1]. However, this capability comes with significant computational costs, creating a major bottleneck for widespread adoption. The training and fine-tuning of these large-scale deep learning models demand intensive computational resources, creating a significant bottleneck for their widespread adoption [1]. Effectively managing these resource demands is particularly crucial within the context of in silico perturbation (ISP) modeling, where researchers aim to create accurate "virtual cell" models that can simulate cellular responses to genetic and chemical perturbations without extensive wet-lab experimentation [9]. This application note provides a structured framework and practical protocols for optimizing computational efficiency when working with scFMs, enabling researchers to balance model performance with practical resource constraints.
Understanding the specific resource requirements of different scFMs is essential for project planning and infrastructure allocation. The computational intensity varies significantly across models based on their architecture, parameter count, and pretraining strategies.
Table 1: Computational Profiles of Prominent Single-Cell Foundation Models
| Model | Parameter Scale | Primary Architecture | Key Resource Intensifiers | Noted Efficiency Features |
|---|---|---|---|---|
| scGPT | Not Specified | GPT-based Decoder | Flash-attention blocks, random gene identity embeddings [5] | Superior memory usage and computational time efficiency [5] |
| Geneformer | 30M-12L to 106M-12L | Transformer | Model depth, attention mechanisms | Efficient cell embedding generation [5] [9] |
| scBERT | Smaller Scale | BERT-like Encoder | Bidirectional attention, gene2vec embeddings [5] | Higher memory consumption relative to performance [5] |
| scFoundation | Not Specified | Transformer | Pretraining corpus size, embedding dimensions | Moderate computational efficiency [5] |
Table 2: Impact of Input Dimensions on Computational Load
| Factor | Effect on Memory | Effect on Training Time | Performance Correlation |
|---|---|---|---|
| Input Gene Sequence Length | Linear increase with longer sequences [5] | Significant increase with longer sequences | scGPT improves with longer inputs; scBERT declines [5] |
| Batch Size | Proportional increase | Decreases with larger batches (to a point) | Optimal batch size varies by model architecture |
| Dataset Integration Complexity | Higher with cross-technology batches [5] | Extended processing for batch correction | Model-dependent: scGPT handles consistency better than cross-technology [5] |
Benchmarking studies reveal that model performance does not always correlate with computational footprint. In comprehensive evaluations, scGPT consistently demonstrated superior computational efficiency in terms of both memory usage and processing time, while scBERT showed declining performance with increasing input sequence length despite significant resource consumption [5]. This highlights the importance of selecting models based not only on reported accuracy but also on their computational characteristics for specific tasks.
The Low-Rank Adaptation (LoRA) technique has emerged as a transformative approach for optimizing computational workload during fine-tuning. LoRA operates on a mathematical insight that weight updates during adaptation have a low "intrinsic rank" and can be represented in a much lower-dimensional space [62].
Instead of updating all parameters in a weight matrix W (with dimensions d×k), LoRA freezes the pre-trained weights and injects trainable rank decomposition matrices. The modified forward pass is represented as: h = W₀x + ΔWx = W₀x + BAx where A ∈ R^{r×k} and B ∈ R^{d×r} are the trainable adaptation matrices, and the rank r ≪ min(d,k) [62].
Table 3: LoRA Configuration for scFM Fine-Tuning
| Component | Recommended Setting | Resource Impact | Performance Consideration |
|---|---|---|---|
| Rank (r) | 4-16 | Higher rank increases trainable parameters | Balance between adaptability and overfitting |
| Alpha | 2×rank | Scaling factor for adapted weights | Affects learning rate sensitivity |
| Target Modules | Attention layers (query, value) | Determines which components are adapted | Critical for maintaining pretrained knowledge |
| Dropout | 0.1 | Regularization during adaptation | Reduces overfitting to small datasets |
Practical implementation of LoRA can reduce trainable parameters by up to 98.4% compared to full fine-tuning, enabling adaptation of billion-parameter models on consumer-grade GPUs with minimal performance degradation [62]. For ISP tasks, this allows researchers to efficiently specialize models for predicting cellular responses to perturbations without prohibitive computational costs.
The GenomeNet-Architect framework demonstrates how multi-fidelity optimization can dramatically reduce computational overhead when tailoring architectures for genomic tasks. This approach uses cheaper approximations of model performance during initial search phases, allocating full resources only to the most promising candidates [63].
The key innovation is the progressive allocation of computational budget: initial configurations are evaluated with shorter training times and smaller data subsets, while only top-performing candidates receive full training cycles. This strategy can reduce overall search time by 67-83% while identifying architectures that outperform expert-designed models [63].
For scFM implementation, this translates to:
In the context of ISP modeling, the closed-loop framework demonstrates how strategic incorporation of experimental data can optimize computational efficiency. This approach integrates perturbation data during model fine-tuning, significantly improving prediction accuracy while managing resource demands [9].
Remarkably, research shows that just 10-20 well-chosen perturbation examples can produce substantial improvements in model accuracy, with performance metrics approaching saturation at approximately 20 examples [9]. This suggests that computationally intensive fine-tuning on massive perturbation datasets may be unnecessary for effective ISP modeling.
Objective: Adapt pre-trained scFMs to predict cellular responses to genetic perturbations while minimizing computational resource requirements.
Materials and Computational Resources:
Procedure:
Data Preparation:
Progressive Fine-Tuning:
Validation and Iteration:
Expected Outcomes: This protocol should achieve a three-fold improvement in PPV (from 3% to 9%) for perturbation prediction while maintaining NPV above 98%, comparable to published closed-loop ISP results [9]. Total training time typically ranges from 2-6 hours on a single GPU, depending on dataset size and model architecture.
Objective: Systematically select appropriate scFMs based on task requirements and computational constraints.
Materials:
Procedure:
Efficiency Profiling:
Trade-off Analysis:
Decision Framework:
Table 4: Key Computational Tools for Managing scFM Resource Demands
| Tool/Resource | Function | Implementation Benefit |
|---|---|---|
| BioLLM Framework | Unified interface for diverse scFMs [5] | Standardized APIs eliminate architectural inconsistencies, reduce implementation overhead |
| LoRA (Low-Rank Adaptation) | Parameter-efficient fine-tuning [62] | Enables adaptation of billion-parameter models on consumer GPUs |
| GenomeNet-Architect | Neural architecture optimization [63] | Automates model selection, reduces design time by 67-83% |
| Closed-Loop ISP Framework | Iterative model refinement with experimental data [9] | Maximizes information gain from minimal perturbation examples (10-20 samples) |
| Multi-Fidelity Optimization | Progressive resource allocation [63] | Reduces computational waste during hyperparameter tuning |
| Flash-Attention Blocks | Memory-efficient attention computation [5] | Enables processing of longer gene sequences within memory constraints |
Managing the computational intensity of scFM training and fine-tuning requires a strategic approach that balances model performance with practical resource constraints. Through the implementation of parameter-efficient fine-tuning techniques like LoRA, strategic model selection informed by benchmarking studies, and resource-aware experimental design, researchers can effectively leverage scFMs for in silico perturbation modeling without prohibitive computational costs. The protocols and frameworks presented here provide a pathway for implementing these models in diverse research environments, from academic laboratories to industrial drug discovery programs. As the field evolves, continued development of computational efficiency methods will be essential for democratizing access to scFM technologies and realizing their full potential for biological discovery and therapeutic development.
The application of single-cell foundation models (scFMs) and other deep learning approaches in biology has created a profound interpretability gap. While these models demonstrate impressive predictive accuracy, their internal representations and decision-making processes often remain opaque black boxes [64] [65]. This opacity presents significant challenges for drug development and biological discovery, where understanding mechanism is as crucial as prediction. This document outlines the core challenges, provides protocols for evaluating latent embeddings, and presents visualization strategies to enhance interpretability within in silico perturbation modeling research.
In biological modeling, the black box problem manifests uniquely and with high stakes. When models like scGPT or scFoundation predict cellular responses to perturbations, we often cannot identify why they make specific predictions or what biological mechanisms they have learned [64] [66]. This limitation has direct consequences:
Recent benchmarking studies reveal that even simple baseline models (e.g., taking the mean of training examples) can outperform complex foundation models in predicting post-perturbation gene expression [66]. Furthermore, basic machine learning models incorporating biologically meaningful features like Gene Ontology vectors significantly outperform scFMs, suggesting that the latent embeddings in these foundation models may not be capturing the most biologically relevant information [66].
| Embedding Type | Model/Dataset | Pearson Delta (Adamson) | Pearson Delta (Norman) | Biological Interpretability |
|---|---|---|---|---|
| GO Term Features | Random Forest | 0.739 | 0.586 | High (direct biological annotation) |
| scELMO Embeddings | Random Forest | 0.706 | 0.663 | Moderate (text-derived semantics) |
| scGPT Embeddings | Random Forest | 0.727 | 0.583 | Low (model-derived, opaque) |
| scGPT Embeddings | Fine-tuned scGPT | 0.641 | 0.554 | Very Low |
| scFoundation Embeddings | Fine-tuned scFoundation | 0.552 | 0.459 | Very Low |
| Train Mean Baseline | None | 0.711 | 0.557 | None |
Data adapted from critical benchmarking studies of post-perturbation RNA-seq prediction models [66]. Pearson Delta measures correlation in differential expression space, with higher values indicating better performance.
The benchmarking data reveals a crucial insight: using foundation model embeddings as features in simpler, interpretable models like Random Forests often yields better performance than the fine-tuned foundation models themselves [66]. This suggests that the information is present in the embeddings but may not be optimally utilized by the complex architectures.
Purpose: To validate whether latent features correspond to genuine biological mechanisms rather than dataset artifacts.
Materials:
Procedure:
Example Application: In the InterPLM study, feature f/939 consistently activated on proteins with a "Nudix box motif." When one strongly activating protein (B2GFH1) lacked this annotation in Swiss-Prot, researchers confirmed through InterPro and structural analysis that it indeed contained a Nudix box—revealing a missing database annotation rather than a model error [64].
Purpose: To ensure latent embeddings capture biologically consistent patterns across different data modalities.
Materials:
Procedure:
Example Application: The GEDI framework enables sample-specific transformations of a reference latent manifold, allowing researchers to disentangle technical variability from genuine biological signals and directly associate latent dimensions with sample characteristics like disease severity [67].
Effective visualization is crucial for interpreting complex biological latent spaces. The following workflow provides a systematic approach to creating interpretable visualizations of scFM embeddings:
Short Title: Color Visualization Workflow
When applying color to latent space visualizations, follow these evidence-based rules derived from colorization research [69]:
| Tool/Resource | Type | Primary Function | Interpretability Features |
|---|---|---|---|
| Sparse Autoencoders (SAEs) | Interpretation Method | Extract interpretable features from model internals | Identifies monosemantic features corresponding to biological concepts [64] |
| GEDI Framework | Analysis Framework | Multi-sample single-cell analysis | Enables cluster-free differential expression along cell state continuum [67] |
| NOBLE | Neural Operator | Captures experimental variability in neuron models | Biologically-informed latent embeddings for neural dynamics [68] |
| PertEval-scFM | Benchmarking Framework | Standardized evaluation of scFMs | Assesses zero-shot embedding quality for perturbation prediction [70] |
| scPerturb | Data Resource | Harmonized single-cell perturbation data | Provides ground truth for evaluating predictive models [71] |
| CellOracle | GRN-Based Prediction | Infers gene regulatory networks | Mechanistically interpretable perturbation predictions [71] |
The analysis of feature f/19746 in the Evo 2 DNA foundation model demonstrates how interpretable latent features can lead to genuine biological discovery [64]. This feature consistently activated across prophage regions in bacterial genomes, including previously unannotated regions. When researchers investigated, they found these regions contained phage-associated genes like integrases and invertases. Crucially, the feature activation pattern revealed the model had learned the functional relationship between CRISPR systems and phage immunity rather than superficial sequence similarity—when researchers scrambled CRISPR spacer sequences, activation persisted, but scrambling the direct repeats eliminated activation [64].
This case exemplifies the potential of interpretability methods to function as discovery tools that can identify missing biological annotations and reveal deeper functional relationships learned by the models.
Addressing interpretability challenges in biological latent embeddings requires both technical advances and cultural shifts in how we evaluate computational models. Promising directions include:
As the field progresses, the goal should not be merely to predict cellular behaviors but to understand them. The protocols and frameworks outlined here provide a pathway toward models that are not just predictive but truly explanatory, accelerating drug development and biological discovery through interpretable in silico perturbation modeling.
Mode collapse occurs when a machine learning model fails to capture the full diversity of the underlying data distribution, producing limited or repetitive predictions. Within the specialized field of in silico perturbation modeling with single-cell foundation models (scFMs), this manifests as an inability to accurately predict the unique cellular responses—specifically, changes in gene expression—elicited by diverse genetic or chemical perturbations [72] [6]. Instead, a collapsed model may default to predicting an average response, thereby obscuring the specific biological signals crucial for therapeutic discovery. Recent benchmarks have revealed a troubling anomaly: sophisticated perturbation-response models are frequently outperformed by a simplistic baseline that predicts the average of all perturbed cells in the training set, disregarding the individual perturbation label [72]. This indicates a systemic issue with how model performance is evaluated and underscores the critical need for robust solutions to ensure predictive diversity.
The primary quantitative signature of mode collapse is anomalously high performance on traditional metrics like unweighted Mean Squared Error (MSE) or control-referenced metrics such as Pearson(Δ), coupled with a failure to recapitulate the effects on specific, differentially expressed genes (DEGs) [72]. A definitive diagnostic check involves comparing your model's performance against a mean baseline—a model that always predicts the average expression profile across the entire training dataset. If a complex model fails to significantly outperform this naive baseline on DEG-aware metrics, it is likely suffering from mode collapse. The core issue is that standard metrics can be gamed by accurately predicting the large, uninteresting regions of the gene expression space that remain unchanged by a perturbation, while missing the critical, albeit smaller, niche signals [72].
Table 1: Key Diagnostic Metrics for Identifying Mode Collapse
| Metric Name | Traditional Use & Pitfall | Proposed Robust Alternative | Interpretation in Diagnostics |
|---|---|---|---|
| Mean Squared Error (MSE) | Measures average L2 error; rewards accuracy on non-changing genes, favoring mean prediction [72]. | Weighted MSE (WMSE) [72] | A model exhibiting collapse will have similar traditional MSE and WMSE, with poor WMSE. |
| Pearson(Δ) | Correlates control-referenced delta; inflated by systematic control bias [72]. | Weighted Delta R² (R²w(Δ)) [72] | Collapsed models show high Pearson(Δ) but low R²w(Δ), indicating failure to predict true effect sizes. |
| Mean Baseline Performance | A naive predictor that outputs the dataset mean; used as a negative control [72]. | Comparison against this baseline using WMSE/R²w(Δ). | A specialist model should significantly and consistently outperform the mean baseline. |
Objective: To determine whether a given scFM for perturbation prediction is experiencing mode collapse. Inputs: Trained perturbation-response model, held-out test set of single-cell perturbation data. Procedure:
This diagnostic workflow is visualized in the following diagram, which outlines the key decision points and analyses required to confirm the presence of mode collapse.
Figure 1: A diagnostic workflow for identifying mode collapse in perturbation models by comparing model performance against a mean baseline using both traditional and robust metrics.
Principle: Directly counter mode collapse by modifying the training objective to prioritize accurate prediction of genes that are most likely to change in response to perturbations [72]. Solution: Replace the standard MSE loss with a Weighted Mean Squared Error (WMSE) loss function. WMSE assigns a higher weight to genes that are known to be differentially expressed across the spectrum of perturbations in the training data, forcing the model to focus its capacity on these informative features. Procedure:
WMSE = (1/N) * Σ(weight_gene_i * (true_expression_gene_i - predicted_expression_gene_i)²).Principle: Leverage limited, high-quality experimental perturbation data to guide and correct the model's predictions, pulling it out of a collapsed state [9]. Solution: Implement a closed-loop fine-tuning framework where the scFM is iteratively updated with data from targeted Perturb-seq experiments. Procedure:
The following diagram illustrates the iterative and cumulative nature of this powerful approach.
Figure 2: The closed-loop fine-tuning protocol for overcoming mode collapse by iteratively incorporating experimental data.
Principle: Actively explore the state space to discover high-reward, unseen modes that the primary model currently misses. This is particularly relevant for generative models like GFlowNets used in biological sequence or perturbation design [73]. Solution: Employ a Loss-Guided GFlowNet (LGGFN) architecture, where an auxiliary agent's exploration is directed toward regions where the main model exhibits high training loss. Procedure:
Table 2: Key resources for developing robust in silico perturbation models.
| Category | Item / Software | Function in Perturbation Modeling |
|---|---|---|
| Benchmark Datasets | Norman et al. (2019) [72], Replogle et al. (2022) [72] | Standardized public datasets for training and benchmarking genetic perturbation models. |
| Computational Models | scGPT [72] [4], Geneformer [4] [9], GEARS [72] [4], CPA [4], LPM [4] | Foundational and specialized models for single-cell analysis and perturbation prediction. |
| Evaluation Metrics | Weighted MSE (WMSE) [72], Weighted Delta R² (R²w(Δ)) [72] | Robust, DEG-aware metrics to properly evaluate model performance and diagnose collapse. |
| Experimental Data | Perturb-seq [9] / CRISPR-screens | High-quality ground-truth data for closed-loop fine-tuning and model validation. |
In silico perturbation modeling with single-cell foundation models (scFMs) represents a paradigm shift in computational biology, enabling the prediction of cellular responses to genetic or chemical interventions. These models, pretrained on vast single-cell transcriptomics corpora, learn fundamental biological principles that can be adapted to specialized tasks through transfer learning. The optimization of scFMs hinges on three interconnected pillars: strategic data curation to ensure biological comprehensiveness and technical quality, thoughtful model architecture selection to capture complex gene-gene interactions, and effective transfer learning protocols to bridge general pretraining with specific applications. This framework provides the foundational methodology for realizing the potential of "virtual cell" models in accelerating therapeutic discovery and mechanistic biology.
The development of robust scFMs requires training on extensive, diverse, and high-quality single-cell datasets that capture a wide spectrum of biological conditions. Strategic data curation begins with leveraging large-scale repositories that provide standardized access to millions of single-cell profiles.
Table 1: Primary Data Sources for scFM Pretraining
| Data Source | Scale | Key Features | Applications |
|---|---|---|---|
| CZ CELLxGENE [1] | >100 million cells [1] | Unified access to annotated single-cell datasets [1] | General pretraining, cross-tissue analysis |
| Human Cell Atlas [1] | Multiorgan coverage [1] | Broad coverage of cell types and states [1] | Reference cell type embedding |
| PanglaoDB [1] | Curated compendium [1] | Data from multiple sources and studies [1] | Specialized model development |
| NCBI GEO/SRA [1] | Thousands of studies [1] | Extensive repository of sequencing data [1] | Supplemental training data |
The assembly of a high-quality, nonredundant dataset is as critical as model architecture for building robust scFMs [1]. This process requires careful selection of datasets, filtering of cells and genes, balancing dataset compositions, and implementing rigorous quality controls to address challenges such as varying sequencing depth, batch effects, technical noise, and inconsistent processing steps across studies [1].
Tokenization converts raw gene expression data into structured inputs that scFMs can process, representing a critical optimization step. Unlike natural language with inherent word order, gene expression data lacks natural sequencing, requiring strategic artificial ordering for transformer architectures.
Gene Ranking Methods: A predominant approach orders genes by expression magnitude within each cell, creating a deterministic sequence where the top-ranked genes form the input "sentence" [1]. Alternative strategies include binning genes by expression values or using normalized counts without complex ranking [1]. Comparative analyses suggest no clear advantage for overly complex ranking systems, with some models reporting robustness using simple normalized counts [1].
Token Enrichment: Beyond basic gene tokens, optimized inputs incorporate special tokens representing cell identity metadata, experimental conditions, or omics modalities [18]. Gene-level metadata such as Gene Ontology terms or chromosomal locations can provide additional biological context [1]. Batch information may be incorporated as special tokens to mitigate technical variations, though some models demonstrate robustness to batch effects without explicit batch tokens [1].
Table 2: Tokenization Strategies in scFMs
| Strategy | Implementation | Advantages | Limitations |
|---|---|---|---|
| Expression-based ranking | Orders genes by expression level within each cell [1] | Deterministic, captures highly expressed features | May overlook low-expression regulatory genes |
| Value binning | Partitions genes into expression bins [1] | Reduces sparsity, groups genes by expression range | Loss of precise expression values |
| Normalized counts | Uses normalized expression without reordering [1] | Simplicity, preserves original expression relationships | May not optimize attention mechanisms |
| Metadata enrichment | Incorporates gene/cell metadata as special tokens [1] [18] | Provides biological context, improves interpretability | Increases model complexity |
scFMs predominantly leverage transformer architectures, which utilize attention mechanisms to weight relationships between gene tokens, enabling the model to identify which genes are most informative for specific cellular identities or states [1]. The adaptation of these architectures to single-cell data requires specialized considerations to address the unique characteristics of transcriptomic information.
Encoder Architectures: Models like scBERT employ bidirectional encoder architectures based on BERT, processing all genes in a cell simultaneously to learn comprehensive contextual relationships [1] [5]. This approach excels in classification tasks such as cell type annotation and embedding generation, where full context understanding is beneficial [1].
Decoder Architectures: Models such as scGPT utilize decoder-inspired architectures with unidirectional masked self-attention, iteratively predicting masked genes conditioned on known genes [1] [18]. This design demonstrates strengths in generative tasks and perturbation prediction, where sequential generation aligns with the autoregressive approach [1].
Hybrid Designs: Emerging architectures explore encoder-decoder combinations and custom modifications to leverage benefits of both approaches [1]. While no single architecture has emerged as universally superior, each demonstrates particular strengths depending on the target application [1].
Rigorous benchmarking reveals distinct performance profiles across scFM architectures, enabling informed model selection based on specific application requirements. The BioLLM framework provides standardized evaluation of multiple models across diverse tasks [5].
Table 3: Architecture Performance Across Tasks (Based on BioLLM Benchmarking [5])
| Model | Architecture Type | Cell Embedding Quality | Batch Effect Correction | Perturbation Prediction | Computational Efficiency |
|---|---|---|---|---|---|
| scGPT | Decoder-based [1] | Superior (ASW: 0.75-0.92) [5] | Excellent [5] | Strong [18] | High [5] |
| Geneformer | Encoder-based [9] | Moderate (ASW: 0.65-0.85) [5] | Moderate [5] | Strong with fine-tuning [9] | High [5] |
| scBERT | Encoder-based [1] [5] | Lower (ASW: 0.45-0.70) [5] | Poor [5] | Limited [5] | Lower [5] |
| scFoundation | Not specified | Moderate [5] | Moderate [5] | Gene-level strength [5] | Moderate [5] |
scGPT consistently demonstrates superior performance in generating biologically relevant cell embeddings, achieving average silhouette width (ASW) scores of 0.75-0.92 across diverse datasets, indicating excellent separation of cell types in latent space [5]. This model also excels in batch effect correction, effectively integrating cells of the same type across experimental conditions [5]. Geneformer shows particular strength in gene-level tasks and perturbation response prediction when fine-tuned, benefiting from its effective pretraining strategy [9] [5]. In contrast, scBERT generally underperforms, likely due to smaller model size and limited training data [5].
Transfer learning bridges general scFM pretraining with specialized applications through two primary approaches: zero-shot inference using pretrained embeddings without additional training, and task-specific fine-tuning that updates model weights on targeted datasets [5].
Zero-Shot Inference Protocol:
Fine-Tuning Protocol:
The "closed-loop" framework represents an advanced transfer learning strategy that iteratively incorporates experimental perturbation data to enhance predictive accuracy [9] [74]. This approach significantly improves upon standard "open-loop" in silico perturbation (ISP) prediction by creating a feedback cycle between computational prediction and experimental validation [9].
Application Protocol - RUNX1-Familial Platelet Disorder:
This closed-loop approach demonstrated substantial improvement over open-loop ISP, increasing positive predictive value from 3% to 9% in T-cell activation studies while maintaining high negative predictive value (99%), sensitivity (76%), and specificity (81%) [9]. Performance gains saturated at approximately 20 perturbation examples, indicating that even modest experimental validation can substantially enhance prediction accuracy [9].
Standardized benchmarking is essential for evaluating scFM performance across diverse biological tasks. The following protocol outlines a comprehensive evaluation framework based on established benchmarking methodologies [75] [5] [16].
Protocol 1: Comprehensive scFM Evaluation
Dataset Curation:
Evaluation Metrics:
Implementation:
Specialized protocols for perturbation prediction enable rigorous assessment of scFM capability to simulate cellular responses to genetic and chemical interventions.
Protocol 2: PertEval-scFM Framework [16]
Model Configuration:
Evaluation Methodology:
Analysis:
Successful implementation of scFMs for in silico perturbation modeling requires access to curated data resources, computational frameworks, and evaluation tools.
Table 4: Essential Research Resources for scFM Implementation
| Resource Category | Specific Tools | Function | Access |
|---|---|---|---|
| Data Repositories | CZ CELLxGENE [1], Human Cell Atlas [1], GEO/SRA [1] | Provide standardized single-cell datasets for pretraining and fine-tuning | Public access |
| Computational Frameworks | BioLLM [5], scGPT [18], Geneformer [9] | Unified interfaces for model training, fine-tuning, and evaluation | Open source |
| Benchmarking Platforms | PertEval-scFM [16], Custom evaluation pipelines [75] | Standardized assessment of model performance across tasks | Open source |
| Specialized Models | scBERT [1], scFoundation [75], scPlantFormer [18] | Task-optimized architectures for specific applications | Open source |
Data Quality Requirements: Effective scFM implementation necessitates careful data curation addressing sparsity, batch effects, and technical noise through rigorous quality control and normalization [1] [75]. The non-sequential nature of gene expression data requires strategic tokenization approaches, with expression-based ranking or binning providing effective input structuring [1].
Computational Resources: Model selection should balance performance requirements with available resources, as scGPT and Geneformer offer favorable efficiency profiles for large-scale analyses [5]. Transfer learning strategy should align with data availability, with zero-shot inference suitable for exploratory analysis and fine-tuning essential for optimal performance on specific tasks [9] [5].
Validation Strategies: Biological relevance should be assessed through ontology-informed metrics that evaluate consistency with prior knowledge [75]. Perturbation predictions require rigorous experimental validation using orthogonal modalities to establish ground truth for model refinement [9].
The advent of single-cell genomics has revolutionized our understanding of cellular heterogeneity, providing unprecedented resolution into the molecular states of individual cells. Concurrently, the rise of artificial intelligence has introduced single-cell foundation models (scFMs)—large-scale deep learning models pretrained on vast single-cell datasets—which promise to learn fundamental biological principles and generalize across diverse downstream tasks [12]. A particularly ambitious application of scFMs is in silico perturbation modeling, which aims to predict transcriptional responses to genetic perturbations without conducting costly wet-lab experiments [16]. This capability holds tremendous potential for accelerating therapeutic discovery and understanding disease mechanisms.
However, the rapid development of these models has created an urgent need for rigorous benchmarking frameworks to evaluate their predictive performance, limitations, and real-world applicability. This application note examines three key benchmarking initiatives—PertEval, PerturBench, and PEREGGRN—that provide standardized methodologies for assessing perturbation prediction capabilities in scFMs. We place special emphasis on PertEval, for which the most comprehensive benchmarking data is currently available, and discuss its implications for the field of computational biology.
PertEval-scFM is a standardized framework specifically designed to evaluate single-cell foundation models for perturbation effect prediction [17]. Its primary objective is to determine whether the contextualized representations (embeddings) learned by scFMs enhance the prediction of transcriptional changes following genetic perturbations compared to simpler baseline approaches. The benchmark operates primarily in a zero-shot setting, assessing the intrinsic capability of model embeddings without task-specific fine-tuning [17] [16].
The philosophical underpinning of PertEval is to test whether scFMs have truly learned fundamental biological principles that generalize to predicting perturbation outcomes. This approach contrasts with traditional benchmarking that might overemphasize performance on narrow tasks where models could be specifically optimized. The framework employs deliberately simple baselines to establish the minimum performance threshold that scFMs should exceed to demonstrate genuine value [11].
PertEval leverages publicly available perturbation datasets that have been widely used in previous model development and validation efforts. Key datasets include:
Data preprocessing follows standardized quality control procedures, including normalization and filtering to ensure comparability across models. For the double perturbation benchmark, the dataset is partitioned with 100 single perturbations and 62 double perturbations used for training/fine-tuning, while the remaining 62 double perturbations are held out for testing [11].
The evaluation workflow in PertEval involves several critical steps:
The entire evaluation process is repeated across multiple random partitions of the data to ensure statistical robustness, with results aggregated across five runs [11].
The PertEval benchmark has yielded several critical insights into the current capabilities and limitations of scFMs for perturbation prediction:
Across multiple evaluation scenarios, scFM embeddings did not consistently outperform simpler baseline models, especially under conditions of distribution shift [17]. The table below summarizes the comparative performance of various models against the established baselines:
Table 1: Performance comparison of scFMs against simple baselines on double perturbation prediction tasks
| Model | Prediction Error (L2 Distance) | Performance vs. Additive Baseline | Genetic Interaction Prediction Accuracy |
|---|---|---|---|
| Additive Baseline | Reference | - | Not Applicable |
| No Change Baseline | Higher than additive | Worse | Poor (limited to buffering interactions) |
| scGPT | Higher than additive | Worse | Mostly predicts buffering interactions |
| Geneformer | Higher than additive | Worse | Rarely predicts synergistic interactions |
| scFoundation | Higher than additive | Worse | Limited to specific gene subsets |
| GEARS | Higher than additive | Worse | Mostly buffering, rarely correct synergistic |
| UCE | Higher than additive | Worse | Similar to no-change baseline |
| scBERT | Higher than additive | Worse | Similar to no-change baseline |
The benchmarking revealed several consistent limitations across current-generation scFMs:
While comprehensive data on PerturBench was limited in the current search results, this benchmarking framework is designed to evaluate how well perturbation prediction models generalize across diverse cellular contexts and experimental conditions. It typically incorporates datasets from multiple cell types and perturbation modalities to assess cross-context transfer learning capabilities.
PEREGGRN specializes in benchmarking models for gene regulatory network inference from single-cell data. This framework addresses the distinct challenge of reconstructing directed regulatory relationships between genes, particularly transcription factors and their targets [76]. Benchmarking in this domain requires specialized ground truth networks derived from experimental data such as ChIP-seq, CRISPR perturbations, and carefully curated databases like RegulonDB [76].
The following table details key computational tools and data resources essential for conducting rigorous benchmarking of perturbation prediction models:
Table 2: Essential research reagents and computational tools for perturbation modeling benchmarking
| Resource Type | Specific Examples | Function/Application |
|---|---|---|
| Single-Cell Foundation Models | scGPT, scFoundation, Geneformer, UCE, scBERT [11] [12] | Generate contextualized embeddings of single-cell states for prediction tasks |
| Benchmarking Frameworks | PertEval-scFM, BEELINE [17] [77] | Standardized evaluation pipelines and performance metrics |
| Ground Truth Datasets | Norman et al. (CRISPRa), Replogle et al. (CRISPRi), Adamson et al. [11] | Experimentally validated perturbation data for training and testing |
| Baseline Models | Additive model, No-change model, Linear models [11] | Simple reference points for establishing minimum performance thresholds |
| Gene Regulatory Networks | STRING, RegulonDB, Cell-type-specific ChIP-seq [76] [77] | Curated molecular interaction data for validation of regulatory predictions |
| Specialized Architectures | 1DCNN-GRU hybrids, Graph Neural Networks, Transformers [78] [12] [77] | Advanced model architectures for capturing spatial and temporal dependencies |
Based on the methodologies employed across these benchmarking initiatives, we propose the following integrated protocol for rigorous evaluation of perturbation prediction models:
The current benchmarking landscape for in silico perturbation modeling reveals significant gaps between the promised capabilities of single-cell foundation models and their actual performance on predictive tasks. The consistent finding that simple baselines remain competitive with—and often outperform—sophisticated scFMs underscores the immaturity of this field and highlights the need for more biologically-grounded architectures and training approaches [17] [11].
Future development should focus on several key areas: (1) creating higher-quality datasets that capture a broader range of cellular states and perturbation strengths [17], (2) developing specialized model architectures that explicitly incorporate biological knowledge about gene regulatory networks [77], and (3) establishing more nuanced benchmarking frameworks that test specific biological capabilities beyond aggregate performance metrics. As these improvements materialize, rigorous benchmarking through initiatives like PertEval will remain essential for guiding progress toward truly predictive in silico models of cellular behavior.
The application of single-cell foundation models (scFMs) to predict gene expression changes following genetic perturbations represents a frontier in computational biology, with significant implications for drug development and basic research. These models, pre-trained on millions of single-cell transcriptomes, promise to serve as "virtual cells" for in silico experimentation, potentially reducing the need for costly and labor-intensive laboratory screens [9] [1]. However, a growing body of rigorous, comparative benchmarking studies reveals a striking consensus: sophisticated scFMs frequently fail to outperform deliberately simple baseline models, such as linear predictors and mean expression models, in predicting perturbation effects [11] [17]. This application note synthesizes critical findings from recent benchmarks, providing researchers with structured data and validated protocols to navigate this rapidly evolving field.
The performance gap between complex scFMs and simple baselines is consistent across diverse experimental contexts. A landmark study published in Nature Methods directly compared five foundation models and two other deep learning models against simple baselines for predicting transcriptome changes after single or double genetic perturbations. The study concluded that "none outperformed the baselines," highlighting a significant challenge for the field [11]. Similarly, the PEREGGRN benchmarking platform, which evaluates methods across 11 large-scale perturbation datasets, found that "it is uncommon for expression forecasting methods to outperform simple baselines" [79]. These findings underscore the importance of critical benchmarking in directing and evaluating methodological development, especially as scFMs are increasingly applied to prioritize therapeutic targets for conditions like RUNX1-familial platelet disorder and T-cell activation [9].
Comprehensive benchmarking across multiple datasets and experimental setups provides a clear, quantitative picture of the current capabilities and limitations of scFMs for perturbation prediction. The table below summarizes key performance metrics from major benchmarking studies, comparing scFMs against simple baseline models.
Table 1: Performance Summary of scFMs vs. Simple Baselines in Perturbation Prediction
| Benchmark Task | Top-Performing scFM | Best Simple Baseline | Performance Comparison | Key Metric | Dataset(s) |
|---|---|---|---|---|---|
| Double Perturbation Prediction | scGPT | Additive Model (Sum of LFCs) | scFM error substantially higher [11] | L2 Distance (Top 1k genes) | Norman et al. [11] |
| Unseen Single Perturbation Prediction | Geneformer (with linear decoder) | Linear Model with Pretrained P | No consistent improvement over baseline [11] | L2 Distance | Replogle et al. (K562, RPE1) [11] |
| Genetic Interaction Identification | Various scFMs | No Change Baseline | No model better than baseline [11] | True-Positive Rate vs. FDP | Norman et al. [11] |
| T-cell Activation Prediction (Open-loop) | Geneformer | Differential Expression (DE) | Superior NPV (98% vs 78%) and specificity (60% vs 50%), but same low PPV (3%) [9] | Predictive Values | Orthogonal Flow Cytometry [9] |
| T-cell Activation Prediction (Closed-loop) | Fine-tuned Geneformer | Open-loop ISP | 3x increase in PPV (3% to 9%) with improved sensitivity/specificity [9] | Positive Predictive Value | Perturb-seq in Primary T-cells [9] |
| Zero-shot Perturbation Effect Prediction | Multiple scFMs (Geneformer, scGPT) | Simple Baseline Models | scFM embeddings provided no consistent improvement, especially under distribution shift [17] | Multiple Metrics | PertEval-scFM Benchmark [17] |
A critical insight from these benchmarks is that even models explicitly designed for perturbation prediction, such as GEARS, scGPT, and scFoundation, struggle to surpass the predictive accuracy of simple models. The "additive model," which predicts double perturbation effects by summing the logarithmic fold changes of individual perturbations, consistently outperformed deep learning models. Similarly, a simple linear model or even predicting the mean expression across training perturbations often proved more effective and computationally efficient than fine-tuning large foundation models [11]. Furthermore, a specialized "closed-loop" fine-tuning approach, which incorporates experimental Perturb-seq data into the model training cycle, demonstrated that scFM performance can be significantly improved. This method achieved a three-fold increase in positive predictive value for T-cell activation, suggesting a viable path for enhancing scFM utility [9].
To ensure reproducibility and facilitate independent validation of these findings, this section outlines detailed protocols for the key benchmarking experiments cited.
This protocol is adapted from the benchmark performed on the dataset from Norman et al., as detailed in [11].
I. Experimental Preparation and Reagents
II. Data Preprocessing
log1p).III. Model Training and Baselines
control_expression + LFC_A + LFC_B, where LFC is the mean logarithmic fold change from the single perturbations in the training data.IV. Performance Evaluation
This protocol is based on the methodology described in [9] for improving T-cell activation predictions.
I. Experimental Preparation
II. Initial Fine-tuning (Open-loop)
III. Closed-loop Fine-tuning
IV. In Silico Perturbation (ISP) and Evaluation
Diagram 1: The closed-loop fine-tuning workflow enhances scFM performance by integrating experimental perturbation data.
To clarify the logical relationships and structural differences between model types and benchmarking outcomes, the following diagrams are provided.
Diagram 2: Simplified architecture comparison of an scFM versus a simple linear baseline for perturbation prediction.
Diagram 3: Decision workflow for choosing a perturbation prediction strategy, based on benchmarking results.
Successful execution of the protocols and interpretation of benchmarking studies require familiarity with key computational tools and data resources. The following table catalogs essential components of the in silico perturbation modeling workflow.
Table 2: Key Research Reagents and Computational Tools for scFM Perturbation Research
| Item Name | Type | Primary Function in Research | Example/Source |
|---|---|---|---|
| Perturbation Datasets | Biological Data | Provides ground truth data for training and benchmarking models. | Norman et al. (CRISPRa), Replogle et al. (CRISPRi), Adamson et al. (UPR genes) [79] [11] |
| Single-cell Foundation Models (scFMs) | Pre-trained Model | Encodes prior biological knowledge from vast scRNA-seq atlases; base for fine-tuning. | Geneformer, scGPT, scFoundation, UCE [11] [1] |
| Benchmarking Platforms | Software Framework | Standardizes evaluation of different models across datasets and tasks. | PEREGGRN [79], PertEval-scFM [17] |
| Simple Baseline Models | Algorithm | Provides a critical performance baseline (e.g., additive, linear, mean predictor). | Additive Model, Linear Model (Y ≈ G x W x Pᵀ), Mean Predictor [11] |
| Gene Embeddings | Data Representation | Vector representations of genes learned by models; can be used in linear predictors. | Extracted from scFoundation or scGPT [11] |
| Perturbation Embeddings | Data Representation | Vector representations of perturbation effects; can be pre-trained on related data. | Extracted from GEARS or learned from data [11] |
In the field of in silico perturbation modeling with single-cell foundation models (scFMs), the accurate assessment of model performance is paramount. Predicting transcriptional responses to genetic perturbations represents a core challenge in functional genomics, with significant implications for revealing gene functions, mapping regulatory networks, and accelerating therapeutic discovery [80]. As the space of possible perturbations is combinatorially complex, computational approaches have been developed to predict transcriptional outcomes of genetic perturbations that were never experimentally tested. The evaluation of these models relies heavily on specific performance metrics that quantify how well predictions match experimental observations.
Recent benchmarking studies have revealed surprising insights about metric performance and interpretation. Simple baseline models—including those that predict the average expression across all perturbed cells (perturbed mean) or the average of matched post-perturbation profiles for combinatorial perturbations (matching mean)—often perform comparably to or even outperform state-of-the-art foundation models like scGPT and scFoundation across multiple datasets [80] [81]. This phenomenon has been largely attributed to systematic variation in perturbation datasets, which represents consistent transcriptional differences between perturbed and control cells arising from selection biases or biological confounders [80]. These findings underscore the critical importance of selecting metrics that can distinguish true perturbation-specific effects from systematic biases.
This protocol explores the theoretical foundations, practical applications, and limitations of the primary metrics used for evaluating perturbation prediction models, with particular emphasis on their implementation within scFM research.
RMSE is defined as the square root of the average squared differences between predicted and observed values. For a sample of n observations y (y_i) and corresponding model predictions ŷ, the RMSE is calculated as:
This metric represents the standard deviation of the prediction errors (residuals), providing a measure of how concentrated the data is around the line of best fit [82]. RMSE is expressed in the same units as the predicted variable, facilitating intuitive interpretation. The theoretical justification for RMSE stems from maximum likelihood estimation, where it is optimal for normally distributed (Gaussian) errors [83]. In perturbation modeling, it penalizes large errors more heavily than small errors due to the squaring of each term, making it particularly sensitive to outliers.
Spearman's rank correlation coefficient (Spearman's ρ) measures how well the relationship between two variables can be described using a monotonic function, whether linear or not [84]. It assesses how similar the ranks of observations are between two variables. The coefficient is calculated as:
where rgX and rgY are the rank variables of the predicted and ground truth values, cov denotes covariance, and σ represents standard deviation [85]. For data without tied ranks, a simplified formula exists:
where d_i is the difference between the two ranks of each observation and n is the sample size [84]. This nonparametric measure is appropriate for both continuous and discrete ordinal variables, making it valuable for assessing whether a model correctly captures the relative ordering of gene expression changes following perturbations, which is often more biologically meaningful than exact numerical predictions.
The choice between RMSE and rank-based metrics should be informed by the error distribution characteristics and research objectives:
Table 1: Metric Selection Guide Based on Error Distribution and Research Goals
| Error Distribution | Optimal Metric | Theoretical Justification | Perturbation Modeling Context |
|---|---|---|---|
| Normal (Gaussian) | RMSE | Maximum likelihood estimator for normal errors [83] | Appropriate for technical replicates with well-controlled experimental conditions |
| Laplace (heavy-tailed) | MAE | Maximum likelihood estimator for Laplacian errors [83] | Better for data with occasional large prediction errors or outliers |
| Unknown or complex | Spearman's ρ | Nonparametric; assesses monotonic relationships without distributional assumptions [84] [85] | Preferred for evaluating ranking of differential expression effects |
A critical challenge in perturbation modeling evaluation is the presence of systematic variation—consistent transcriptional differences between perturbed and control cells that arise from selection biases in the perturbation panel or underlying biological confounders [80]. This variation can profoundly impact metric interpretation:
Recent benchmarking studies have revealed substantial discrepancies in model rankings depending on the chosen evaluation metric:
Table 2: Comparative Performance of Models and Baselines Across Metrics and Datasets
| Dataset | Model | PearsonΔ | PearsonΔ20 | RMSE | Spearman's ρ |
|---|---|---|---|---|---|
| Adamson | Train Mean | 0.711 | - | - | - |
| Adamson | scGPT | 0.641 | - | - | - |
| Adamson | RF with GO features | 0.739 | - | - | - |
| Norman | Train Mean | 0.557 | - | - | - |
| Norman | scGPT | 0.554 | - | - | - |
| Norman | RF with GO features | 0.586 | - | - | - |
| Replogle K562 | Train Mean | 0.373 | - | - | - |
| Replogle K562 | scGPT | 0.327 | - | - | - |
| Replogle K562 | RF with GO features | 0.480 | - | - | - |
| Generic evaluation | CPA | Variable | Variable | Variable | - |
| Generic evaluation | GEARS | Variable | Variable | Variable | - |
Data adapted from benchmark studies [80] [81]. PearsonΔ represents correlation in differential expression space.
Unexpectedly, the simple Train Mean baseline consistently matches or exceeds the performance of sophisticated foundation models like scGPT and scFoundation across multiple datasets when evaluated using Pearson correlation in differential expression space (PearsonΔ) [81]. Similarly, in predicting combinatorial perturbation responses, the matching mean baseline outperformed all other methods by considerable margins (11% improvement for PearsonΔ over the best alternative method) [80].
To address metric limitations, the Systema framework has been developed specifically for evaluating genetic perturbation response prediction beyond systematic variation [80]. This framework introduces two key advances:
The framework implementation is available on GitHub (https://github.com/mlbio-epfl/systema) and provides more biologically meaningful assessment of perturbation response modeling [80].
The following workflow diagram illustrates the comprehensive evaluation process for perturbation prediction models:
Diagram Title: scFM Perturbation Evaluation Workflow
Dataset Selection and Partitioning
Baseline Model Implementation
Model Inference
Pseudo-bulk Creation
Reference-based Differential Expression
Gene Selection for Evaluation
RMSE Calculation
Spearman's Rank Correlation Calculation
Pearson Correlation in Differential Expression Space
Pathway Enrichment Analysis
Cell State Distribution Analysis
Installation and Setup
git clone https://github.com/mlbio-epfl/systemaPerturbation-Specific Effect Isolation
Perturbation Landscape Reconstruction Assessment
Table 3: Key Research Reagents and Computational Tools for Perturbation Metric Evaluation
| Resource Name | Type | Function in Evaluation | Implementation Notes |
|---|---|---|---|
| Systema [80] | Evaluation Framework | Isolates perturbation-specific effects from systematic variation | GitHub: mlbio-epfl/systema; Requires AnnData format |
| Perturbation Datasets [80] [81] | Experimental Data | Benchmark model performance | Adamson, Norman, Replogle datasets; PEX splitting recommended |
| Train Mean Baseline [80] [81] | Baseline Model | Simple benchmark for average perturbation effects | Average of training pseudo-bulk profiles |
| Matching Mean Baseline [80] | Baseline Model | Benchmark for combinatorial perturbations | Average of matching single-gene perturbation centroids |
| AUCell [80] | Analysis Tool | Scores pathway activity in single cells | Identifies systematically enriched pathways |
| GSEA [80] | Analysis Method | Gene set enrichment analysis | Detects pathway-level systematic variation |
| scGPT [80] [81] | Foundation Model | Benchmark complex model architecture | Fine-tune on perturbation data |
| GEARS [80] | Prediction Method | Benchmark model using biological networks | Incorporates prior knowledge |
| Random Forest with GO features [81] | Baseline Model | Biologically-informed baseline | Uses Gene Ontology vectors as features |
The relationship between error distributions and metric appropriateness can be visualized as follows:
Diagram Title: Metric Selection Based on Error Distribution
When reporting perturbation prediction results, include these essential elements:
Baseline Comparison: Always report performance of simple baselines (Train Mean, Matching Mean) alongside model results [80] [81]
Multiple Metric Perspective: Report both RMSE and rank correlation metrics to provide complementary views of performance
Systematic Variation Assessment: Document the extent of systematic variation in datasets and its potential impact on metrics [80]
Statistical Significance: Include confidence intervals or p-values for correlation metrics to distinguish meaningful differences from random variation
Biological Validation: Where possible, supplement quantitative metrics with biological validation of predicted perturbation effects
The evaluation of perturbation prediction models requires careful metric selection informed by both statistical principles and biological considerations. The recent discovery that simple baselines can outperform sophisticated foundation models underscores the limitations of current evaluation approaches and the pervasive influence of systematic variation in perturbation datasets [80] [81].
The introduction of bias-aware evaluation frameworks like Systema represents significant progress toward more biologically meaningful assessment [80]. Future developments should focus on creating metrics that better capture a model's ability to predict functionally relevant perturbation effects rather than merely recapitulating systematic biases. As the field advances, standardized evaluation protocols incorporating these insights will be essential for meaningful comparison of perturbation prediction methods and translation of computational predictions to biological discovery and therapeutic development.
A principal ambition in the development of single-cell Foundation Models (scFMs) is their capacity for zero-shot prediction—the ability to accurately forecast the effects of genetic or chemical perturbations without task-specific fine-tuning. This capability is considered a critical benchmark for true biological understanding within these models. The rationale is that a model which has internalized the fundamental rules of cell biology from its pretraining data should be able to generalize its knowledge to novel experimental conditions, including unseen perturbations. Such a capability would revolutionize drug discovery and functional genomics by enabling in silico screening of perturbation outcomes, drastically reducing experimental costs and time. However, recent rigorous benchmarking studies have revealed a significant gap between this ambition and current model capabilities, showing that scFMs often fail to outperform deliberately simple baselines on perturbation prediction tasks [11].
The core challenge lies in the models' ability to move beyond pattern recognition in their training data to genuine mechanistic reasoning about novel perturbations. This application note synthesizes current evidence on the zero-shot generalization capacities of leading scFMs, providing structured experimental protocols and benchmarks to guide their evaluation in perturbation modeling tasks. By establishing standardized assessment frameworks, we aim to facilitate more meaningful comparisons across models and accelerate progress toward truly generalizable perturbation prediction systems.
The field of single-cell Foundation Models has rapidly diversified, with multiple architectures employing different pretraining strategies and learning objectives. Understanding these foundational differences is crucial for interpreting their varied performance on zero-shot perturbation tasks. These models are predominantly built on transformer architectures and learn from vast single-cell RNA sequencing corpora, but they diverge significantly in their approach to representing biological information [1] [12].
Table 1: Key Single-Cell Foundation Models and Their Architectures
| Model | Architecture Type | Parameters | Pretraining Data Scale | Key Innovation |
|---|---|---|---|---|
| scGPT | Decoder-only Transformer | 50 million | 33 million cells | Generative pretraining with gene expression prediction [4] |
| Geneformer | Encoder-only Transformer | 40 million | 30 million cells | Rank-based gene tokenization; mechanistic network learning [1] |
| scFoundation | Asymmetric encoder-decoder | 100 million | 50 million cells | Read-depth-aware masked gene modeling [10] |
| scBERT | Bidirectional Encoder | Not specified | Millions of cells | Early transformer adaptation for single-cell data [1] |
| UCE | Encoder with protein embeddings | 650 million | 36 million cells | Incorporates protein sequence information via ESM-2 embeddings [10] |
| LPM (Large Perturbation Model) | Decoder-only with disentangled conditioning | Not specified | Heterogeneous perturbation data | Explicit disentanglement of Perturbation, Readout, and Context (PRC) [4] |
A critical differentiator among these models is their tokenization strategy—how they convert gene expression data into sequences that transformers can process. Most models represent individual genes as tokens, but they employ different methods for handling expression values and gene ordering. Some models like Geneformer rank genes by expression level to create input sequences, while others like scGPT use value binning or projection techniques [1] [12] [10]. The recently proposed Large Perturbation Model (LPM) introduces a novel approach by explicitly disentangling the representation of perturbations, readouts, and experimental contexts, allowing it to integrate more heterogeneous perturbation data across different modalities [4].
Recent comprehensive benchmarking studies have yielded sobering results regarding the zero-shot perturbation prediction capabilities of current scFMs. A landmark study published in Nature Methods in 2025 compared five foundation models and two other deep learning models against deliberately simple baselines for predicting transcriptome changes after single or double genetic perturbations [11]. The benchmarks assessed models on their ability to predict expression changes following double perturbations in K562 cells using data from Norman et al., with models trained on single perturbations and a subset of double perturbations then evaluated on held-out double perturbations.
The results revealed that no deep learning model outperformed a simple additive baseline that predicts the sum of individual logarithmic fold changes without using any double perturbation data during training. All scFMs had substantially higher prediction error (L2 distance between predicted and observed expression values) than this simplistic approach. Similarly, for predicting genetic interactions—where the phenotype of simultaneous perturbations differs unexpectedly from additive effects—none of the models performed better than a "no change" baseline that always predicts control condition expression [11].
Table 2: Benchmarking Results on Double Perturbation Prediction Tasks
| Model Category | Representative Models | Performance on Double Perturbation | Performance on Genetic Interaction Prediction | Key Limitations |
|---|---|---|---|---|
| Simple Baselines | Additive model, No-change model | Reference standard | No-change model competitive with complex models | Biologically simplistic but surprisingly effective |
| Specialized Perturbation Models | GEARS, CPA | Worse than additive baseline | Not better than no-change baseline | Limited generalization beyond training data |
| General scFMs | scGPT, Geneformer, scFoundation, scBERT | Worse than additive baseline | Not better than no-change baseline; rarely predict synergistic interactions correctly | Struggle to capture nonlinear interaction effects |
| New Architectures | LPM | State-of-the-art on some tasks but limited zero-shot tests | Shows promise but requires rigorous benchmarking | Limited evaluation on true zero-shot scenarios |
These findings suggest that the goal of creating foundation models that provide generalizable representations of cellular states capable of predicting the outcome of not-yet-performed experiments "remains elusive" with current approaches [11]. The models particularly struggled to predict synergistic interactions, with most predominantly predicting buffering interactions and rarely making correct predictions of true synergistic effects.
The ability to predict effects of completely unseen perturbations represents an even greater challenge and more rigorous test of zero-shot capabilities. In benchmarks assessing prediction of single gene perturbation effects across different cell lines (K562 and RPE1), no deep learning model consistently outperformed even simpler baselines, including a linear model with carefully constructed embeddings or simply predicting the mean expression across training perturbations [11].
Notably, when researchers extracted gene embeddings from scFoundation and scGPT and used them in a simple linear model, this approach performed as well as or better than the native implementations of scGPT and GEARS with their built-in decoders. However, these embedding-enhanced linear models still did not consistently outperform linear models using embeddings derived directly from the perturbation data itself [11].
The most effective approach identified was a linear model with perturbation representations pretrained on relevant perturbation data, suggesting that pretraining on perturbation data specifically may be more valuable than pretraining on general single-cell atlases alone. This finding questions whether the current paradigm of pretraining on broad single-cell corpora is optimal for perturbation prediction tasks.
Objective: Evaluate model capability to predict transcriptomic effects of unseen double gene perturbations after training on single perturbations and a subset of double perturbations.
Datasets:
Experimental Design:
Evaluation Metrics:
Baseline Comparisons:
Implementation Considerations:
Objective: Assess model capability to generalize perturbation effects across cellular contexts without fine-tuning.
Datasets:
Experimental Design:
Evaluation Metrics:
Baseline Comparisons:
Objective: Evaluate model capability to predict mechanisms of action for novel chemical compounds based on structural or functional similarity to training compounds.
Datasets:
Experimental Design:
Evaluation Metrics:
Zero-Shot Evaluation Workflow for scFMs
Architectural Comparison for Perturbation Prediction
Table 3: Key Computational Tools and Frameworks for scFM Perturbation Research
| Tool/Resource | Type | Primary Function | Application in Perturbation Studies |
|---|---|---|---|
| BioLLM | Standardized framework | Unified interface for diverse scFMs | Enables consistent benchmarking across models with standardized APIs [86] [5] |
| GEARS | Specialized perturbation model | Predicts effects of single and double gene perturbations | Baseline for genetic perturbation prediction tasks [11] |
| CPA | Compositional perturbation autoencoder | Predicts effects of perturbation combinations and dosages | Handles drug combinations and dose-response relationships [11] |
| LPM | Large perturbation model | Integrates heterogeneous perturbation experiments | Cross-modal perturbation prediction (genetic + chemical) [4] |
| CELLxGENE | Data repository | Curated single-cell datasets | Source of standardized training and benchmarking data [1] [12] |
| Norman et al. Dataset | Benchmark dataset | CRISPR activation perturbation data | Gold standard for double perturbation benchmarking [11] |
| Replogle et al. Dataset | Benchmark dataset | CRISPRi perturbation data across cell lines | Evaluation of cross-cell-line generalization [11] |
When interpreting zero-shot perturbation prediction results, several critical considerations emerge from recent benchmarking studies:
Performance relative to simple baselines: The fact that simple additive models or mean predictors remain competitive with sophisticated scFMs suggests that current models may not be capturing higher-order biological interactions as effectively as hoped [11]. This performance gap should be openly acknowledged when presenting results.
Task-specific strengths: No single model consistently outperforms others across all perturbation tasks. For example, while scGPT demonstrates robust performance across multiple tasks in some benchmarks, other models like Geneformer and scFoundation show specialized strengths in gene-level tasks [86] [5]. Model selection should be guided by specific application requirements.
Data leakage concerns: Given that many scFMs are pretrained on massive single-cell corpora that may include perturbation data, rigorous protocols are needed to ensure that "unseen" perturbations in benchmarks are truly novel and not present in any form during pretraining [10].
Biological plausibility vs. quantitative accuracy: While quantitative metrics like L2 distance are important, biological plausibility of predictions should also be assessed through gene set enrichment analysis, pathway activation scores, and expert biological validation.
Current evidence suggests that the field must temper expectations about the zero-shot capabilities of existing scFMs while continuing to develop more sophisticated benchmarks and model architectures. The promising performance of newer approaches like LPM that explicitly model the disentanglement of perturbations, readouts, and contexts points toward potentially fruitful architectural directions [4].
The assessment of zero-shot prediction capabilities for unseen perturbations reveals both significant challenges and promising pathways forward. Current scFMs show limited ability to generalize beyond their training data to novel perturbations, particularly for complex genetic interactions and cross-context transfer. However, standardized benchmarking frameworks like BioLLM are enabling more rigorous comparisons, while novel architectures like LPM suggest potential strategies for improvement [86] [4] [5].
Critical future directions include: (1) developing more sophisticated benchmarking protocols that better reflect real-world biological discovery scenarios, (2) creating models that explicitly represent biological mechanisms rather than relying solely on statistical patterns in training data, (3) improving the integration of diverse data types including protein structures, pathway information, and chemical properties, and (4) establishing clearer evaluation metrics that balance quantitative accuracy with biological plausibility.
As the field progresses, the community would benefit from increased focus on model interpretability, better documentation of pretraining data composition, and more rigorous separation between training and evaluation data to enable true assessment of generalization capabilities. Through these efforts, the vision of accurate in silico prediction of perturbation effects may gradually transition from aspirational to achievable.
In the field of single-cell genomics, single-cell Foundation Models (scFMs) have emerged as transformative tools for interpreting the complex language of gene expression. Models like Geneformer, scGPT, scBERT, and UCE are pretrained on millions of single-cell transcriptomes, promising to capture universal biological principles and accelerate discovery in areas like drug development and disease modeling [10]. A core application driving this promise is in silico perturbation modeling—the ability to computationally predict how cells respond to genetic or chemical interventions. However, as these models are increasingly considered for high-stakes research, a critical and comparative evaluation of their capabilities, limitations, and optimal use cases is essential. This application note synthesizes recent benchmarking studies and practical protocols to provide a structured framework for researchers, scientists, and drug development professionals to effectively leverage these leading scFMs within their perturbation modeling workflows.
The comparative strength of an scFM is fundamentally shaped by its architectural choices and the data on which it was trained. The table below summarizes the core design principles of the four leading models.
Table 1: Architectural and Pretraining Overview of scFMs
| Model | Architecture | Pretraining Data Scale | Input Gene Representation | Primary Pretraining Task |
|---|---|---|---|---|
| Geneformer [10] [87] | Transformer Encoder | 30 million cells | 2,048 ranked genes (no expression values) | Masked Gene Modeling (MGM) with Gene ID prediction |
| scGPT [10] [88] | Transformer Encoder | 33 million cells | ~1,200 HVGs with binned expression values | Iterative MGM with MSE loss; generative pretraining |
| scBERT [10] [89] | Transformer Encoder | Not specified in context | Not specified in context | MGM with gene ID prediction |
| UCE [11] [10] | Transformer Encoder | 36 million cells | 1,024 genes sampled by expression & genomic position | Binary prediction of whether a gene is expressed |
Key distinctions include how they handle gene expression values: scGPT incorporates expression magnitudes through binning, whereas Geneformer uses a rank-based approach, discarding absolute expression to focus on the relative order of genes. UCE uniquely integrates protein sequence information by initializing its token embeddings using ESM-2 protein language model embeddings, providing a direct link to proteomic data [10].
A critical application of scFMs is predicting the transcriptomic changes following genetic or chemical perturbations. Recent rigorous benchmarks, however, reveal a significant performance gap between promise and practice.
A landmark study published in Nature Methods directly compared five foundation models, including scGPT and Geneformer, against deliberately simple linear baselines for predicting transcriptome changes after single or double genetic perturbations [11]. The simple baselines were:
Strikingly, none of the deep learning models outperformed the simple additive baseline in predicting double perturbation effects [11]. Furthermore, the models showed a poor ability to predict genetic interactions (e.g., synergistic or buffering effects), with none performing better than the no-change baseline [11].
The ability to predict the effect of a completely novel perturbation is a key claim for foundation models. In this task, a simple mean predictor (which always predicts the average expression profile across the training set) and a linear model using Gene Ontology (GO) features consistently matched or outperformed sophisticated fine-tuned foundation models like scGPT and scFoundation across multiple Perturb-seq datasets [81]. In some cases, using the gene embeddings from scGPT as features for a simple Random Forest regressor yielded better performance than scGPT's own full fine-tuned pipeline, suggesting that the pretrained embeddings contain valuable biological information that the models' complex decoders struggle to leverage effectively for this task [11] [81].
Table 2: Summary of Benchmarking Results in Perturbation Prediction
| Task | Top Performing Model(s) | Underperforming Model(s) | Key Metric |
|---|---|---|---|
| Double Perturbation Effect Prediction | Additive Baseline Model [11] | scGPT, Geneformer, UCE, scBERT, GEARS, CPA [11] | L2 distance on top genes |
| Genetic Interaction Prediction | No-change Baseline Model [11] | scGPT, Geneformer, UCE, scBERT, GEARS, CPA [11] | True-Positive Rate vs. False Discovery Proportion |
| Unseen Single Perturbation Prediction | Random Forest with GO features; Mean Predictor [81] | scGPT, scFoundation (fine-tuned) [81] | Pearson Delta (Δ) Correlation |
| Zero-shot Cell Type Clustering | HVG Selection, scVI, Harmony [90] | scGPT, Geneformer (zero-shot) [90] | Average BIO Score |
Despite benchmarking challenges, these models are powerful tools when applied correctly. The following protocols outline standard workflows for two common scenarios.
This protocol is adapted from an end-to-end workflow for achieving high-accuracy retinal cell type annotation [91]. The following diagram illustrates the overall workflow.
Title: scGPT Fine-tuning and Inference Workflow
Key Steps:
This protocol outlines the process for adapting Geneformer to predict donor metadata, such as age group, from single-cell transcriptomes [87].
Key Steps:
Successful application of scFMs relies on a foundation of data, software, and computational resources.
Table 3: Essential Research Reagents and Resources
| Item Name | Function / Application | Example / Specification |
|---|---|---|
| Annotated H5AD File | Standard data container for single-cell data; primary model input. | Contains .X (expression matrix), .obs (cell metadata), and .var (gene metadata) [88]. |
| High-Variable Genes (HVGs) | Reduces dimensionality and computational cost; focuses model on most informative genes. | Typically the top 1,000-2,000 HVGs are used as input for models like scGPT [88]. |
| Pre-trained Model Weights | Provides the foundation of biological knowledge; starting point for fine-tuning. | Downloaded from official sources (e.g., Hugging Face for Geneformer, Google Drive for scGPT) [88] [87]. |
| GPU Computing Resource | Accelerates model fine-tuning and inference, reducing time from hours to minutes. | Tested on setups like an NVIDIA A100, T4, or consumer-grade hardware with sufficient VRAM (>=32GB system RAM recommended) [88] [89]. |
| Gene Ontology (GO) Annotations | Provides prior biological knowledge; can be used as features in simple, high-performing baseline models [81]. | Used as input for Random Forest or linear models in benchmarking studies. |
| Perturbation Datasets | Gold-standard data for training and benchmarking in silico perturbation models. | Includes Norman et al. (CRISPRa), Adamson et al., and Replogle et al. (CRISPRi) datasets [11] [81]. |
The current landscape of single-cell foundation models presents a paradox of immense potential tempered by rigorous benchmarking. Models like Geneformer and scGPT have demonstrated strong performance in specific tasks like cell type annotation and metadata prediction when properly fine-tuned [87] [91]. However, for the pivotal task of in silico perturbation prediction, they have not yet consistently surpassed simple, biologically-informed baselines [11] [81]. This suggests that while their pretrained embeddings capture valuable biological information, their complex architectures may not be optimally decoding this information for predictive causal tasks.
For researchers and drug development professionals, the following strategic recommendations are proposed:
The ongoing development of scFMs is a rapidly evolving frontier. Future model generations, trained on even larger and more diverse datasets and potentially incorporating more causally-aware architectures, are poised to more fully deliver on the promise of accurate in silico perturbation modeling.
Single-cell foundation models (scFMs), such as scGPT and Geneformer, represent a paradigm shift in computational biology, trained on millions of cells from atlases like the Human Cell Atlas to learn universal representations of cellular states [18]. These models are increasingly employed for in silico perturbation (ISP) prediction, aiming to simulate cellular responses to genetic or chemical interventions without costly experiments [9] [4]. However, a fundamental challenge arises from the data fidelity gap—the discrepancy between the large-scale, observational "atlas" data used for pretraining and the specific, high-fidelity data generated in controlled perturbation experiments. This application note examines the technical basis of this gap, presents quantitative evidence of its impact on model performance, and provides detailed protocols to bridge it, thereby enhancing the predictive accuracy of scFMs in therapeutic discovery contexts.
The performance disparity between models using only atlas data versus those incorporating targeted perturbation data is substantial and measurable. Systematic benchmarking reveals that while scFMs are powerful, their open-loop ISP predictions can suffer from low positive predictive value.
Table 1: Performance Comparison of Open-Loop vs. Closed-Loop ISP for T-cell Activation Prediction
| Model Type | Positive Predictive Value (PPV) | Negative Predictive Value (NPV) | Sensitivity | Specificity | AUROC |
|---|---|---|---|---|---|
| Open-Loop ISP (Geneformer) | 3% | 98% | 48% | 60% | 0.63 |
| Differential Expression (DE) | 3% | 78% | 40% | 50% | Not reported |
| DE & Open-Loop ISP Overlap | 7% | Not reported | Not reported | Not reported | Not reported |
| Closed-Loop ISP (with perturbation data) | 9% | 99% | 76% | 81% | 0.86 |
Data derived from evaluation of Geneformer-30M-12L fine-tuned on T-cell activation status [9].
As shown in Table 1, the integration of even a limited number of perturbation examples during fine-tuning can produce a three-fold increase in PPV while simultaneously improving sensitivity and specificity [9]. This demonstrates that the data fidelity gap is not merely theoretical but has concrete, quantifiable effects on a model's ability to identify true positive targets.
The data fidelity gap arises from several interconnected technical and biological factors:
This protocol describes a iterative fine-tuning process that incorporates experimental perturbation data to "close the loop" and enhance a model's predictive accuracy for a specific biological context, such as a disease model [9].
Workflow Overview:
Step-by-Step Procedure:
Initial Context Fine-Tuning
Open-Loop In Silico Perturbation
Target Prioritization
Experimental Validation
Closed-Loop Fine-Tuning
For researchers without immediate access to wet-lab capabilities, leveraging existing large-scale perturbation models and datasets provides an alternative strategy to mitigate the fidelity gap.
Workflow Overview:
Step-by-Step Procedure:
Model Selection and Access
Querying the Model for Target Discovery
Mechanism of Action (MoA) Analysis
Table 2: Essential Computational and Experimental Reagents for Bridging the Fidelity Gap
| Category | Reagent / Tool | Function / Description | Key Application in Protocol |
|---|---|---|---|
| Foundation Models | scGPT [18] | A generative pretrained transformer model for single-cell multi-omics analysis. Pretrained on >33 million cells. | Protocol 1: Base model for context fine-tuning and open-loop ISP. |
| Geneformer [9] [4] | A transformer model pretrained on a large corpus of transcriptomic data to learn a foundational representation of network dynamics. | Protocol 1: Used for closed-loop fine-tuning in T-cell and RUNX1-FPD case studies. | |
| Large Perturbation Model (LPM) [4] | A model integrating heterogeneous perturbation data by disentangling Perturbation, Readout, and Context (PRC) dimensions. | Protocol 2: Primary model for predicting outcomes of unobserved perturbations and MoA analysis. | |
| Data Platforms | DISCO / CZ CELLxGENE [18] | Platforms aggregating over 100 million cells for federated analysis and data retrieval. | Protocol 1: Source of initial scRNA-seq data for context fine-tuning. |
| LINCS Database [4] | A repository containing perturbation responses for thousands of genetic and chemical perturbagens across many cell lines. | Protocol 2: Key data source for training and querying LPMs. | |
| Experimental Tools | Perturb-seq [9] | A high-throughput method combining CRISPR-based perturbations with single-cell RNA sequencing to read out molecular phenotypes. | Protocol 1: Critical for generating high-fidelity validation data for closed-loop learning. |
| CRISPRi/a Screens [9] | CRISPR interference or activation screens to repress or activate target genes, often coupled with functional readouts (e.g., flow cytometry). | Protocol 1: Provides orthogonal, functional validation of ISP predictions. |
The data fidelity gap is a critical, yet addressable, challenge in the application of single-cell foundation models to perturbation modeling. Quantitative benchmarks demonstrate that moving from an open-loop to a closed-loop framework can drastically improve predictive accuracy. The protocols provided here—ranging from a comprehensive wet-lab-in-the-loop fine-tuning process to a computational-focused approach using existing LPMs—offer actionable roadmaps for researchers. By systematically integrating high-fidelity perturbation data, scFMs can evolve from powerful pattern recognition engines into reliable, predictive virtual cells capable of accelerating therapeutic discovery.
In silico perturbation modeling with single-cell foundation models represents a paradigm shift with immense potential for accelerating therapeutic discovery and understanding cellular mechanisms, particularly for rare diseases where patient samples are scarce. The development of 'closed-loop' frameworks demonstrates a promising path forward, showing that iterative incorporation of experimental data can significantly boost predictive accuracy. However, recent comprehensive benchmarks present a sobering counterpoint, revealing that current scFMs often fail to outperform deliberately simple baselines on perturbation prediction tasks. This underscores that the field is still in its nascent stages, with significant challenges remaining in model generalization, interpretability, and computational efficiency. The future success of this field will depend on developing more specialized architectures, curating higher-quality and more diverse perturbation datasets, establishing rigorous and standardized benchmarking practices, and fostering a tighter integration between computational prediction and experimental validation. Ultimately, overcoming these hurdles will be crucial for transforming the promise of 'virtual cells' into a reliable tool for biomedical discovery and clinical translation.