Cross-species cell annotation foundation models represent a transformative advance in single-cell biology, enabling the deciphering of universal gene regulatory mechanisms across evolution. This article explores how AI models like GeneCompass, TranscriptFormer, and CAME leverage vast datasets from multiple species to accurately identify cell types, predict disease states, and simulate cellular behavior. We examine the foundational principles, methodological architectures, optimization strategies, and validation frameworks that underpin these tools. For researchers and drug development professionals, this synthesis provides critical insights for applying these models to translate findings from model organisms to human biology, accelerating the discovery of disease mechanisms and therapeutic targets.
Cross-species cell annotation foundation models represent a transformative advance in single-cell biology, enabling the deciphering of universal gene regulatory mechanisms across evolution. This article explores how AI models like GeneCompass, TranscriptFormer, and CAME leverage vast datasets from multiple species to accurately identify cell types, predict disease states, and simulate cellular behavior. We examine the foundational principles, methodological architectures, optimization strategies, and validation frameworks that underpin these tools. For researchers and drug development professionals, this synthesis provides critical insights for applying these models to translate findings from model organisms to human biology, accelerating the discovery of disease mechanisms and therapeutic targets.
Cross-species cell annotation represents a computational frontier in evolutionary biology and translational research, enabling the transfer of cellular knowledge from model organisms to humans. The advent of single-cell RNA sequencing (scRNA-seq) has generated massive cellular atlases across diverse species, creating an unprecedented opportunity to decipher conserved and divergent cellular programs [1] [2]. Foundation models (FMs), pre-trained on millions of cells through self-supervised learning, have emerged as powerful tools to address the fundamental challenge of cross-species annotation: reconciling genomic differences to identify homologous cell types across evolutionary distances [3] [4] [5]. These models transform single-cell transcriptomics by treating cells as "sentences" and genes as "words," learning deep biological representations that transcend species boundaries through sophisticated architectural innovations [4]. This protocol examines the defining architectures, performance benchmarks, and practical implementation of cross-species cell annotation foundation models, providing researchers with a framework for leveraging these transformative tools in evolutionary biology and drug development.
Comprehensive evaluation of cross-species annotation models reveals distinct performance advantages across different biological contexts and evolutionary distances. The table below synthesizes key quantitative findings from major benchmarking studies.
Table 1: Performance Benchmarking of Cross-Species Cell Annotation Methods
| Model | Core Approach | Test Scenarios | Key Performance Metrics | Comparative Advantage |
|---|---|---|---|---|
| CAME [1] | Heterogeneous graph neural network | 54 scRNA-seq datasets across 7 species; 649 species pairs | Significant improvement in cell-type assignment across distant species; 6.26% average accuracy drop when excluding non-one-to-one homologies | Utilizes non-one-to-one homologous gene mapping; robust to sequencing depth inconsistencies |
| SATURN [2] | Protein language model (ESM-2) with macrogene space | Mammalian cell atlas (335k cells); frog-zebrafish embryogenesis | Effective annotation transfer across evolutionarily remote species; identification of misannotated cell populations | Discovers functionally related gene groups; enables cross-species differential expression analysis |
| Icebear [6] | Neural network decomposition of cell identity, species, and batch factors | Mouse-chicken-opossum brain/heart sci-RNA-seq3 | Accurate cross-species prediction of single-cell profiles; reveals X-chromosome upregulation evolutionary patterns | Enables single-cell resolution comparison without cell type matching; predicts missing biological contexts |
| Nicheformer [3] | Transformer trained on dissociated and spatial data (110M cells) | Spatial composition prediction; spatial label prediction | Outperforms Geneformer, scGPT, UCE, and CellPLM on spatial tasks | Integrates spatial context; transfers spatial information to dissociated data |
| scFMs (General) [5] | Various transformer architectures pretrained on large single-cell corpora | 5 datasets with diverse biological conditions; 7 cancer types | Robust and versatile across tasks; no single model dominates all scenarios | Captures biological insights into relational structure of genes and cells |
Recent benchmarking studies demonstrate that foundation models exhibit particular strengths in capturing biological relationships. A comprehensive evaluation of six scFMs against traditional baselines revealed that while these models are robust and versatile tools for diverse applications, simpler machine learning models can be more efficient for specific datasets under resource constraints [5]. Notably, no single scFM consistently outperforms others across all tasks, emphasizing the importance of task-specific model selection.
Table 2: Performance Across Cell-Level Tasks in Realistic Conditions
| Task Category | Top Performing Models | Key Findings | Considerations for Cross-Species Application |
|---|---|---|---|
| Cell Type Annotation [5] | scGPT, Geneformer, UCE | >80-90% accuracy for major cell types; struggles with rare cell types | Model performance correlates with cell-property landscape roughness in latent space |
| Cross-Species Transfer [1] [2] | CAME, SATURN | Effective even for non-model species and evolutionarily remote pairs | Dependency on quality of homologous gene mapping or protein embeddings |
| Spatial Context Prediction [3] | Nicheformer | Systematically outperforms models trained only on dissociated data | Requires spatial transcriptomics data for training; enables tissue niche prediction |
| Clinical Prediction [5] | scGPT, scFoundation | Accurate cancer cell identification and drug sensitivity prediction in zero-shot settings | Potential for translating findings from model organisms to human clinical contexts |
The CAME framework employs a heterogeneous graph neural network architecture that explicitly incorporates both one-to-one and non-one-to-one homologous gene mappings, which is particularly crucial for distant species comparisons where up to 60-75% of highly informative genes may not have one-to-one homologs [1].
Experimental Protocol:
Input Processing:
Graph Construction:
Model Architecture:
Training Protocol:
Output Interpretation:
SATURN introduces a novel approach that couples gene expression with protein embeddings from large language models (e.g., ESM-2) to create universal cell embeddings that transcend genomic differences between species [2].
Experimental Protocol:
Input Preparation:
Macrogene Space Construction:
Model Pretraining:
Weakly Supervised Training:
Cross-Species Differential Expression:
Nicheformer represents a paradigm shift by incorporating both dissociated single-cell and spatially resolved transcriptomics data during pretraining, enabling the transfer of spatial context across species [3].
Experimental Protocol:
Data Curation:
Tokenization Strategy:
Model Architecture & Training:
Spatial Downstream Tasks:
Table 3: Essential Research Reagents and Computational Resources
| Resource Category | Specific Solutions | Function in Cross-Species Annotation | Key Considerations |
|---|---|---|---|
| Data Repositories | CZ CELLxGENE [4], Human Cell Atlas [4], Tabula Sapiens [7] [2], Tabula Muris [2] | Provide standardized, annotated single-cell datasets for multiple species | Data quality varies; requires careful selection and filtering for pretraining |
| Protein Language Models | ESM-2 [2] | Generate protein embeddings that capture functional similarity beyond sequence homology | Enable remote homology detection; computationally intensive |
| Spatial Transcriptomics Technologies | MERFISH, Xenium, CosMx, ISS [3] | Provide spatial context for model training; enable spatial annotation transfer | Targeted gene panels limit gene coverage; technology-specific biases exist |
| Homology Databases | Orthologous gene mappings [1] [3] | Define evolutionary relationships between genes across species | Non-one-to-one homologies are crucial for distant species comparisons |
| Benchmarking Datasets | Asian Immune Diversity Atlas (AIDA) v2 [8] [5] | Provide independent validation across diverse populations and cell types | Essential for evaluating model generalization and avoiding data leakage |
| Computational Infrastructure | GPU clusters, High-performance computing [8] [5] | Enable model pretraining on millions of cells | Significant resources required; barrier to entry for some research groups |
Successful implementation of cross-species annotation models requires careful consideration of several practical factors. Model selection should be guided by specific research questions, as no single foundation model consistently outperforms others across all tasks [5]. For well-established cell types in closely related species, traditional methods may offer computational efficiency, while for novel cell types or distant species comparisons, foundation models with protein language model integration or spatial awareness provide distinct advantages.
Data quality and preprocessing significantly impact model performance. Careful batch effect correction, quality control filtering, and normalization are essential, particularly when integrating datasets across different technologies and species [4]. For cross-species applications, the handling of homologous relationships is critical—methods that incorporate non-one-to-one homologies or protein embeddings generally outperform those restricted to one-to-one gene matches, especially for evolutionarily distant species [1] [2].
Rigorous validation is essential for cross-species annotations. Biological validation should include examination of conserved marker gene expression, assessment of functional enrichment in predicted cell types, and comparison with orthogonal data modalities when available [2]. Computational validation metrics should extend beyond simple accuracy to include ontological similarity measures that capture hierarchical relationships between cell types [5].
Interpretation of model outputs requires special consideration in cross-species contexts. Low-confidence predictions may indicate species-specific cell states rather than annotation failures. Attention mechanisms and feature importance analyses can reveal the gene programs driving cross-species alignments, providing biological insights beyond simple cell type transfer [1] [2].
The field of cross-species cell annotation foundation models is rapidly evolving, with several promising research directions emerging. Multimodal integration that combines transcriptomic, epigenetic, proteomic, and spatial information will likely enhance annotation accuracy and biological relevance [4]. Few-shot and zero-shot learning approaches are being developed to handle rare cell types and poorly characterized species [5]. Additionally, methods that explicitly model evolutionary distances and phylogenetic relationships may improve annotation transfer across broader taxonomic ranges.
As these models mature, development of more sophisticated benchmarking frameworks and standardized evaluation metrics will be crucial for advancing the field. Community efforts to create comprehensive cross-species benchmark datasets and establish best practices for model reporting will enable more systematic comparisons and accelerate progress in this transformative area of computational biology.
The development of robust cross-species foundation models requires a clear quantitative understanding of the biological and technical variabilities involved. The table below summarizes key metrics and benchmarks essential for designing and evaluating such models.
Table 1: Key Quantitative Benchmarks in Cross-Species Single-Cell Analysis
| Metric / Component | Description / Value | Biological Significance / Impact |
|---|---|---|
| Evolutionary Distance (Data) | Training on 12 species spanning 1.5 billion years of evolution [9] | Enables model generalization across vast evolutionary scales and OOD prediction. |
| Data Scale (Model Training) | Models pretrained on >100 million cells from diverse public archives [10] | Provides the foundational "knowledge base" for the model's understanding of cellular biology. |
| CCD Boundary Conservation (Human vs. Chimpanzee) | 71.2% of human CCD boundaries are shared with chimpanzees [11] | Provides a quantitative measure of 3D genome architecture conservation between close species. |
| Minimum Contrast Ratio (WCAG AA - Large Text) | At least 3:1 [12] [13] | Ensures accessibility and legibility for data visualization interfaces and published findings. |
| Enhanced Contrast Ratio (WCAG AAA - Body Text) | At least 7:1 [14] | A higher standard for legibility in critical displays and publications. |
Objective: To compile a large-scale, diverse, and high-quality single-cell dataset for pretraining a foundational model capable of cross-species generalization.
Materials:
Methodology:
Objective: To train a transformer-based model on the curated corpus using self-supervised learning, enabling it to learn fundamental principles of gene expression.
Materials:
Methodology:
k highly variable genes or all genes above an expression threshold.[CELL] token to aggregate cell-level context and append modality tokens ([RNA], [ATAC]) for multi-omics models [10].Objective: To evaluate the model's ability to accurately transfer cell type labels from a well-annotated reference species to a target species, including out-of-distribution (OOD) species.
Materials:
Methodology:
Table 2: Essential Resources for Cross-Species Foundation Model Research
| Resource / Reagent | Type | Primary Function | Example / Source |
|---|---|---|---|
| Curated Data Platforms | Data Repository | Provides unified access to standardized, annotated single-cell datasets for model training. | CZ CELLxGENE [10], Tabula Sapiens [9] |
| Multi-Omics Data | Data Type | Enables training of models that can integrate gene expression with chromatin accessibility for a more comprehensive view. | scATAC-seq, Multiome Sequencing [10] |
| Pretrained Foundation Models | Software Model | Provides a starting point for transfer learning, saving computational resources and time. | TranscriptFormer [9], scGPT [10], scBERT [10] |
| Accessibility Evaluation Tools | Software Tool | Ensures that data visualization dashboards and UIs meet contrast standards for inclusive science. | axe DevTools [15], WebAIM Color Checker [12] |
Cross-species cell annotation foundation models represent a paradigm shift in biomedical research, enabling the transfer of biological knowledge across evolutionary distances. By leveraging large-scale single-cell transcriptomic data from multiple species, these models decipher conserved and species-specific cellular principles, accelerating discoveries from basic evolution to translational medicine [10] [9]. The following applications highlight their transformative potential.
Objective: To identify evolutionarily conserved gene expression programs and cell-type relationships across species, providing insights into fundamental cellular mechanisms preserved over billions of years of evolution.
Background: A primary challenge in evolutionary biology is distinguishing conserved core biological processes from species-specific adaptations. Single-cell foundation models (scFMs), trained on diverse multi-species datasets, learn latent representations that encapsulate both universal cellular states and lineage-specific differences [9]. For instance, TranscriptFormer was pretrained on 112 million cells from 12 species, covering 1.5 billion years of evolutionary divergence, creating a model that intrinsically understands cellular homology and variation [9].
Key Findings:
Quantitative Performance: The following table summarizes the cross-species cell type classification performance of a foundational model (TranscriptFormer) compared to baseline methods.
Table 1: Cross-Species Cell Type Classification Accuracy (%)
| Model / Species | Rhesus Macaque | Marmoset | Mouse | Zebrafish |
|---|---|---|---|---|
| TranscriptFormer | 92.5 | 89.7 | 85.1 | 78.3 |
| Baseline Model A | 88.1 | 84.3 | 79.5 | 70.2 |
| Baseline Model B | 90.2 | 86.5 | 81.8 | 72.9 |
Note: Accuracy reflects the model's ability to correctly annotate cell types in species not seen during training (out-of-distribution species). Results are aggregated from benchmark tasks detailed in the TranscriptFormer preprint [9].
Objective: To leverage cross-species models to understand human disease pathophysiology, predict disease states from cellular transcriptomes, and improve the translational relevance of animal models.
Background: A significant obstacle in drug development is the failure of findings from animal models to translate to human patients. scFMs can identify conserved disease-associated gene networks and predict cellular responses to perturbation, thereby providing a more reliable bridge between model organisms and human biology [10] [9].
Key Findings:
Quantitative Performance: The table below compares the performance of foundational models in predicting disease states against baseline models.
Table 2: Disease State Prediction Performance (F1 Score)
| Model / Disease Task | SARS-CoV-2 Infection | Ageing Brain Classification | Cancer Cell Identification |
|---|---|---|---|
| TranscriptFormer | 0.94 | N/A | N/A |
| scBERT | N/A | 0.89 | N/A |
| scGPT | N/A | N/A | 0.91 |
| Baseline Model A | 0.87 | 0.82 | 0.85 |
| Baseline Model B | 0.90 | 0.85 | 0.88 |
Note: F1 score (0-1) is the harmonic mean of precision and recall. Higher scores indicate better performance. scBERT and scGPT are other prominent single-cell foundation models [10]. Ageing brain classification performance is derived from benchmarking on human prefrontal cortex snRNA-seq data [17].
Purpose: To annotate cell types in a target species (e.g., Marmoset) using a model trained on a source species (e.g., Human).
Principle: Foundation models like TranscriptFormer learn a shared latent space where analogous cell types from different species are positioned proximately based on conserved gene expression patterns, enabling knowledge transfer without explicit labels in the target species [9].
Materials:
Procedure:
Purpose: To harmonize two or more single-cell datasets from different studies and species, removing technical batch effects while preserving meaningful biological differences.
Principle: The Cross-Species Normalization (CSN) method is designed to explicitly reduce technical variance between datasets while conserving interspecies biological variation. It is based on an evaluation criterion that maximizes the removal of experimental artifacts and minimizes the loss of biological signal [18].
Materials:
Procedure:
Table 3: Essential Research Reagents and Resources for Cross-Species Analysis
| Item | Function & Application | Example/Specification |
|---|---|---|
| CZ CELLxGENE | A curated data platform providing unified access to millions of annotated single-cells from diverse species and tissues. Used for model pre-training and validation [9]. | https://cellxgene.cziscience.com/ |
| Ensembl BioMart | A data mining tool to obtain lists of one-to-one orthologous genes between species (e.g., human and mouse). Critical for gene space alignment before cross-species analysis [18]. | http://www.ensembl.org/biomart/martview |
| TranscriptFormer Model | A generative, cross-species foundation model for single-cell transcriptomics. Used for out-of-distribution cell type annotation, disease prediction, and gene interaction modeling [9]. | Available via CZI's virtual cell platform. |
| Cross-Species Normalization (CSN) | A dedicated normalization algorithm for harmonizing datasets from different studies and species. Reduces technical effects while better preserving biological differences compared to EB, DWD, or XPN [18]. | R/Python implementation as described in [18]. |
| scPred Model | A classification model used to map and align cell types across species within a defined atlas, enabling the identification of conserved and variable cell types [16]. | R package 'scPred'. |
{ "abstract": "The analysis of biological systems has undergone a profound transformation, shifting from isolated single-species models to integrative multi-species frameworks. This evolution, driven by the recognition of complex ecological interactions and the advent of high-throughput single-cell genomics, is revolutionizing fields from conservation ecology to therapeutic development. This Application Note details the quantitative evidence supporting this paradigm shift, provides standardized protocols for implementing multi-species analysis, and visualizes the core workflows and reagent tools essential for researchers and drug development professionals engaged in cross-species investigation." }
The traditional approach to modeling biological systems has long been dominated by single-species models. In ecology, these models focused on the population dynamics of a single species in isolation [19]. Similarly, in early single-cell genomics, cell type annotation was often performed by analyzing one dataset or one species at a time, relying on manual curation and limited marker genes [20] [21]. These methods, while useful for initial insights, fundamentally ignored the complex web of biological interactions and shared evolutionary patterns that define real-world biological systems. The intrinsic limitations of this single-species approach—including an inability to accurately forecast population changes in ecological communities and a lack of robustness when annotating cell types across diverse datasets or species—created a pressing need for more sophisticated frameworks [19] [22].
The shift to multi-species analysis frameworks represents a response to these limitations, enabled by advances in computational power and the accumulation of large-scale datasets. In ecology, this means jointly modeling interacting species to produce superior forecasts [22]. In single-cell biology, it has given rise to cross-species foundation models like TranscriptFormer, which are pretrained on millions of cells from multiple species to learn conserved biological principles [10] [9]. These models leverage the transformative transformer architecture to interpret the "language" of cells across evolutionary distances, allowing for the prediction of cell types and disease states even in species not seen during training [10] [9]. This document details the experimental evidence, protocols, and tools that underpin this critical transition, providing a roadmap for researchers to implement multi-species analyses.
Empirical evidence consistently demonstrates the superior performance of multi-species frameworks over their single-species predecessors. The tables below summarize key quantitative comparisons from ecological and single-cell genomic studies.
Table 1: Comparative Performance of Ecological Forecasting Models
| Model Type | Key Features | Forecast Performance | Study Context |
|---|---|---|---|
| Single-Species Model | Models species in isolation; ignores biotic interactions [19]. | Lower accuracy in hindcast and forecast vs. multi-species models [22]. | Rodent population dynamics over 25 years [22]. |
| Multi-Species Dynamic Model | Jointly models species with shared environmental responses & temporal dependencies [22]. | Superior hindcast and forecast performance; captures nonlinear, lagged effects [22]. | Nine rodent species in a semi-arid community [22]. |
Table 2: Comparative Performance of Single-Cell Annotation Tools
| Method / Model | Underlying Principle | Key Advantages | Reference |
|---|---|---|---|
| Manual Annotation | Expert curation of marker genes for each cluster [21]. | Considered the "gold standard"; allows for deep biological insight [21]. | [21] |
| Automated Tools (e.g., PCLDA) | Simple statistical pipelines (PCA, LDA) for classification [23]. | High interpretability, computational efficiency, and stability across protocols [23]. | [23] |
| Foundation Models (e.g., TranscriptFormer, scGPT) | Transformer-based AI pretrained on massive, multi-species atlases [10] [9]. | Cross-species cell type prediction; identification of disease states; predicts gene-gene interactions [9]. | [10] [9] |
This protocol details the use of a pre-trained foundation model, such as TranscriptFormer, to annotate cell types in a query dataset from a species that was not necessarily part of the model's training data [9].
Input Data Preparation (Query Dataset)
anndata object and save it as an H5AD file [24].adata.X, adata.raw.X, or a specified layer. Perform initial quality control to filter out low-quality cells [24].Model Selection and Setup
pip install 'cellxgene[annotate]' within a Python 3.9+ environment [24].Execution of Annotation
cellxgene annotate ./query_data.h5ad --model-url https://model-repository.org/model.zip --output-h5ad-file ./annotated_data.h5ad [24].Exploration and Validation
cellxgene launch ./annotated_data.h5ad [24].cxg_cell_type_predicted) and the associated uncertainty score (cxg_cell_type_predicted_uncertainty) [24].This protocol outlines the steps for constructing a dynamic generalized additive model (GAM) to forecast the population abundances of multiple interacting species, as validated in rodent communities [22].
Data Compilation and Preprocessing
Model Specification
Model Fitting and Inference
Forecast Generation and Evaluation
Successful implementation of multi-species frameworks relies on a suite of computational tools and curated data resources.
Table 3: Key Resources for Multi-Species Single-Cell Analysis
| Resource Name | Type | Function in Research | Relevant Citation |
|---|---|---|---|
| CZ CELLxGENE | Data Platform / Tool | Provides unified access to millions of annotated single-cells; hosts automated annotation pipeline [24]. | [24] [9] |
| ACT (Annotation of Cell Types) | Web Server | Provides a knowledge-based platform for cell type enrichment using a hierarchically organized marker map [21]. | [21] |
| TranscriptFormer | Foundation Model | A generative, multi-species model for predicting cell types, disease states, and gene interactions across evolution [9]. | [9] |
| scGPT / scBERT | Foundation Model | Transformer-based models for single-cell biology, pretrained on large corpora for various downstream tasks [10]. | [10] |
| PCLDA | Annotation Algorithm | A simple, interpretable, and robust pipeline for cell annotation based on PCA and LDA [23]. | [23] |
| Hierarchical Marker Map | Curated Knowledgebase | A collection of canonical markers and differentially expressed genes organized by tissue and cell type, used for enrichment testing [21]. | [21] |
The transition from single-species to multi-species analysis frameworks is a cornerstone of modern biology, enabling more accurate predictions and a deeper, more unified understanding of complex systems. In ecology, multi-species models are proving essential for reliable forecasting and informed conservation [22]. In single-cell genomics, foundation models like TranscriptFormer are breaking down species barriers, creating a powerful new paradigm for discovering conserved cellular functions and disease mechanisms [9]. The protocols, tools, and visualizations provided in this Application Note offer a practical foundation for researchers to integrate these advanced multi-species approaches into their work, ultimately accelerating progress in both fundamental biology and therapeutic development.
The transformer architecture, originally designed for natural language processing (NLP), has catalyzed a revolution in computational biology. Its core mechanism, self-attention, allows models to dynamically weigh the importance of different elements in a sequence, whether words in a sentence or genes in a cell [25]. This capability to capture complex, long-range dependencies makes it uniquely suited for biological data. In single-cell transcriptomics, this has led to the emergence of single-cell Foundation Models (scFMs)—large-scale deep learning models pretrained on vast datasets that can be adapted for diverse downstream tasks like cell type annotation and gene regulatory network inference [10] [4]. These models treat a cell's transcriptome as a "sentence" and individual genes as "words," thereby learning the fundamental language of cellular biology from millions of cells across diverse tissues and species [10].
Transformer-based models are delivering state-of-the-art performance across a wide spectrum of biological applications. The table below summarizes the quantitative performance of several prominent models on key tasks.
Table 1: Performance of Transformer-Based Models in Biological Applications
| Model Name | Primary Application | Key Performance Metric | Result |
|---|---|---|---|
| scGREAT [26] | Gene Regulatory Network (GRN) Inference | Average AUROC (Cell-type-specific ChIP-seq) | 90.5% (range 81.4% - 95.0%) |
| scGREAT (vs. other methods) [26] | GRN Inference | Performance improvement in AUROC vs. GENELink, GNE, CNNC | +6.3%, +15.5%, +23.9% respectively |
| Delphi-2M [27] | Disease Trajectory Prediction | Average AUC (across disease spectrum) | ~0.76 |
| scTab [28] | Cross-tissue Cell Type Annotation | Scaling performance | Performance scales with model size & training data size |
| TranscriptFormer [9] | Cross-species Cell Type Annotation | Cell type classification in unseen species | State-of-the-art (outperforms comparable models) |
A premier application of scFMs is cross-species cell annotation. TranscriptFormer, a generative multi-species model trained on 112 million cells from 12 species, demonstrates a remarkable ability to identify cell types in species not included in its training data (e.g., rhesus macaque and marmoset) [9]. This capability to translate gene expression patterns across vast evolutionary distances is crucial for biomedical research, as it helps predict whether findings in model organisms are likely to translate to humans.
Inferring the complex regulatory interactions between transcription factors and their target genes is a fundamental challenge in biology. scGREAT leverages a transformer backbone to infer GRNs from single-cell transcriptomics data. Its superior performance, outperforming other contemporary methods on seven benchmark datasets, highlights the transformer's ability to capture the intricate, non-linear relationships within gene regulatory systems [26].
Beyond single-cell analysis, transformers are being adapted to model human health. Delphi-2M uses a modified GPT architecture to learn patterns of disease progression from population-scale health records [27]. It can predict future rates of over 1,000 diseases conditional on an individual's past medical history, providing meaningful estimates of potential disease burden for up to 20 years and enabling a new paradigm in personalized health risk assessment.
This protocol outlines the key steps for developing a transformer-based model for single-cell transcriptomics, synthesizing methodologies from models like scGPT, scBERT, and TranscriptFormer [10] [4].
1. Data Curation and Preprocessing
2. Tokenization and Input Representation
3. Model Architecture and Pretraining
4. Downstream Task Fine-Tuning
Diagram: Single-Cell Foundation Model Workflow
This protocol details the application of a pretrained model like TranscriptFormer for annotating cell types in a new, unlabeled dataset from a novel species [9].
1. Model Acquisition and Input Preparation
2. Generation of Cell Embeddings
3. Cell Type Prediction
4. Validation and Interpretation
Diagram: Cross-Species Annotation Process
The development and application of biological transformer models rely on a suite of computational "reagents" and resources.
Table 2: Key Research Reagents and Resources for scFM Development
| Resource Category | Specific Examples | Function and Utility |
|---|---|---|
| Data Repositories | CZ CELLxGENE [9], Human Cell Atlas [10], GEO/SRA [10], PanglaoDB [10] | Provide large-scale, curated single-cell datasets essential for pretraining foundational models. |
| Model Architectures | Transformer Encoder (BERT-style) [10], Transformer Decoder (GPT-style) [27] [10], Hybrid Architectures | Serve as the core computational engine for building attention-based models. |
| Pretraining Tasks | Masked Language Modeling [10] [4] | Enables self-supervised learning on unlabeled data, forcing the model to learn meaningful biological context. |
| Benchmarking Platforms | BEELINE [26] | Provides standardized datasets and evaluation frameworks to fairly compare model performance. |
| Ontologies | Cell Ontology (CL) [28] | Provides a structured, hierarchical vocabulary for cell types, critical for standardizing model outputs and evaluations. |
Transformer architectures have successfully bridged the gap between natural language and the language of biology, enabling the creation of powerful foundation models for single-cell transcriptomics. These models, trained on millions of cells across diverse species and conditions, are revolutionizing biological discovery. They excel at cross-species cell annotation, gene regulatory network inference, and disease trajectory prediction, providing scalable and accurate tools for researchers and drug developers. As data corpora continue to expand and model architectures are refined, these AI-powered virtual cells promise to deepen our understanding of cellular function and accelerate the development of new therapeutics.
The emergence of single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to study cellular heterogeneity at unprecedented resolution. However, a significant challenge remains in comparing and annotating cell types across different species, which is crucial for understanding evolutionary biology and translating findings from model organisms to humans. Cross-species cell annotation foundation models address this challenge by leveraging large-scale single-cell transcriptomic data from multiple organisms to learn universal representations of cellular states [10]. These models typically employ transformer-based architectures, originally developed for natural language processing, to decipher the "language" of gene regulation by treating individual cells as sentences and genes or genomic features as words [10] [5].
The fundamental paradigm involves pre-training on massive, diverse single-cell datasets through self-supervised learning objectives, enabling the models to capture core biological principles of gene regulation and cell identity that are conserved across evolutionary distances [29] [9]. This pre-training phase allows the models to develop a foundational understanding of cellular biology that can then be fine-tuned for specific downstream tasks such as cell type classification, disease state identification, and gene regulatory network inference [10] [5]. By integrating data from multiple species, these models can decipher universal gene regulatory mechanisms and facilitate knowledge transfer between organisms, accelerating the discovery of critical cell fate regulators and candidate drug targets [29] [30].
Table 1: Key specifications of cross-species cell annotation foundation models
| Model | Training Data Scale | Species Coverage | Architecture | Parameters | Key Innovation |
|---|---|---|---|---|---|
| GeneCompass | 101.7M cells after processing [29] | Human, Mouse [29] | 12-layer transformer [29] | Not specified | Integrates 4 types of prior biological knowledge [29] |
| TranscriptFormer | 112M cells [9] | 12 species across 1.5B years of evolution [9] | Transformer encoder, 12 layers, 16 attention heads [31] | 368-542 million [31] | Expression-aware attention; ESM-2 protein embeddings [31] |
| Icebear | Not specified | Mouse, Opossum, Chicken [6] | Neural network framework [6] | Not specified | Decomposes single-cell measurements into cell identity, species, and batch factors [6] |
| CAME | Not explicitly detailed in search results | Information not available in search results | Information not available in search results | Information not available in search results | Information not available in search results |
The architectural implementations of these models reflect their specialized approaches to cross-species analysis. GeneCompass employs a knowledge-informed framework that incorporates four types of prior biological knowledge during pre-training: gene regulatory networks, promoter sequences, gene family annotations, and gene co-expression relationships [29]. It uses a masked language modeling strategy where 15% of gene inputs are randomly masked in each cell, with the model trained to recover both gene IDs and expression values simultaneously [29]. This approach enhances the model's ability to capture intricate gene relationships in a context-aware manner.
TranscriptFormer introduces a novel expression-aware attention mechanism where expression counts are incorporated as a log-count bias term in the attention matrix, avoiding explicit token duplication [31]. It utilizes ESM-2 protein embeddings for gene representation and includes an assay token to capture sequencing platform metadata [31]. The model is trained autoregressively for both gene identities and their counts, and employs strategic shuffling to randomly permute expressed genes each batch to remove positional bias [31].
Icebear employs a fundamentally different approach by designing a neural network framework that decomposes single-cell measurements into separable factors representing cell identity, species, and batch effects [6]. This factorization enables the model to perform cross-species prediction of gene expression profiles by swapping the species factor corresponding to each cell, facilitating direct comparison of expression profiles across species at single-cell resolution without relying on external cell type annotations [6].
Figure 1: Architectural overview of cross-species foundation models showing shared inputs and diverse application outputs
The training methodologies for these foundation models involve sophisticated pre-training strategies on massive single-cell datasets. GeneCompass was trained on over 120 million human and mouse single-cell transcriptomes (with 101.7 million cells retained after quality control) using a self-supervised learning approach [29] [30]. The model incorporates homologous gene mapping between species, with 17,465 homologous genes out of 36,092 total genes in its token dictionary [29]. For each cell, the top 2048 genes are selected to construct the context after normalizing and ranking gene expression values, then absolute gene expression values are concatenated with corresponding gene IDs for stronger supervision constraints [29].
TranscriptFormer employs a multi-species training approach with balanced sampling across evolutionary diverse organisms. The model up-weights low-resource species to balance against human and mouse data dominance [31]. It was trained using the AdamW optimizer with linear warm-up followed by cosine decay, with a global batch size of approximately 4-5 million tokens [31]. The training processed approximately 3.5 trillion tokens over up to 15 epochs using mixed-precision floating point (fp16/bf16) on H100 GPU clusters with Distributed Data Parallel (DDP) [31].
Icebear's training protocol involves a unique decompositional approach where the model learns to separate species factors from cell identity factors [6]. This enables the model to perform cross-species imputation by swapping species factors while preserving cell identity factors. The framework demonstrates particular utility for predicting single-cell profiles in missing cell types across species and facilitates direct comparison of expression profiles for conserved genes that have undergone chromosomal repositioning during evolution [6].
Table 2: Performance benchmarking across key biological tasks
| Model | Cross-Species Cell Type Classification (Macro F1) | Disease State Identification | Gene Regulatory Inference | Evolutionary Distance Generalization |
|---|---|---|---|---|
| GeneCompass | Superior to SOTA for single species [29] | Demonstrated for cell fate transition [29] | Validated via in silico gene deletion [29] | Captures homology across human and mouse [29] |
| TranscriptFormer | F1 > 0.7 for species separated by ~600+ million years [31] | F1 ~0.85-0.86 for COVID-19 infected vs. healthy [31] | Predicts cell type-specific TF interactions [9] | Covers 1.5B years of evolution across 12 species [9] |
| Icebear | Enables single-cell resolution comparison [6] | Predicts human Alzheimer's disease from mouse models [6] | Reveals evolutionary expression patterns [6] | Applied to eutherians, metatherians, and birds [6] |
Comprehensive benchmarking reveals the distinct strengths and specializations of each model. A recent systematic evaluation of six single-cell foundation models against established baselines using 12 metrics across diverse biological tasks provides insights into their relative performance characteristics [5]. The study found that no single scFM consistently outperforms others across all tasks, emphasizing the need for tailored model selection based on factors such as dataset size, task complexity, and computational resources [5].
TranscriptFormer demonstrates remarkable cross-species generalization capabilities, maintaining F1 scores above 0.7 for species separated by approximately 600+ million years of evolution, such as stony coral [31]. For human-specific tasks, it achieves macro F1 scores up to 0.91+ on the Tabula Sapiens v2 dataset and approximately 0.85-0.86 in distinguishing SARS-CoV-2 infected versus non-infected cells in human lung tissue [31].
GeneCompass has been experimentally validated for its ability to capture biological meaningfulness through in silico gene deletion studies [29]. When the model performed in silico deletion of GATA4 or TBX5 (genes with known roles in congenital heart disease) in human fetal cardiomyocytes, it correctly identified greater impact on their direct target genes compared to indirect targets, housekeeping genes, and other congenital heart disease-related genes, with statistically significant differences (p-value < 0.05 by t-test) [29]. This demonstrates that the pre-trained model effectively learned genuine gene regulatory relationships.
Icebear has been applied to study evolutionary biology questions, particularly regarding X-chromosome upregulation (XCU) in mammals [6]. By predicting and comparing gene expression changes across eutherian mammals (mouse), metatherian mammals (opossum), and birds (chicken), Icebear revealed gene expression pattern shifts that support the existence of mammalian XCU and suggest the extent and molecular mechanisms of XCU vary among mammalian species and among X-linked genes with distinct evolutionary origins [6].
Figure 2: Experimental workflow for developing and validating cross-species foundation models
Table 3: Essential research reagents and computational resources for cross-species foundation model research
| Resource Category | Specific Examples | Function/Purpose | Implementation Considerations |
|---|---|---|---|
| Data Resources | CZ CELLxGENE [9] [31] | Curated single-cell data with standardized annotations | Provides >100M unique cells; essential for pre-training |
| Tabula Sapiens [31] | Human scRNA-seq reference cell atlas | Used for evaluation with multiple donor IDs | |
| ZebraHub, GEO Accessions [31] | Species-specific single-cell data | Critical for cross-species generalization tests | |
| Computational Infrastructure | H100/A100 GPU Clusters [31] | Model training and inference | Memory-intensive; A100 40GB recommended for inference |
| DDP Training Framework [31] | Distributed training across multiple GPUs | Enables processing of trillion-token datasets | |
| Gene Reference Databases | ESM-2 Protein Embeddings [31] | Protein sequence-based gene representations | Provides biological context beyond expression |
| Homology Mapping Resources [29] | Orthologous gene identification across species | Critical for cross-species model integration | |
| Evaluation Benchmarks | scGraph-OntoRWR [5] | Cell ontology-informed metric | Measures biological relevance of embeddings |
| COVID-19 Lung Atlas [31] | Disease state classification benchmark | Tests infection response identification | |
| Multi-species Spermatogenesis Data [31] | Cross-species cell type annotation | Evaluates transfer learning across evolutionary distances |
The translational potential of cross-species foundation models extends significantly into drug development and biomedical research. These models enable more accurate translation of findings from model organisms to humans, which has historically been a major bottleneck in preclinical drug development [29] [6]. By identifying conserved cellular responses and pathways across species, researchers can prioritize drug targets with higher likelihood of translational success.
GeneCompass has demonstrated practical utility in identifying key factors associated with cell fate transitions, with experimental validation showing that predicted candidate genes could successfully induce the differentiation of human embryonic stem cells into gonadal fate [29] [30]. This capability opens new avenues for regenerative medicine and cellular therapy development by accelerating the discovery of critical cell fate regulators.
TranscriptFormer's ability to identify disease states without fine-tuning presents significant opportunities for drug discovery [9] [31]. The model surpassed baseline models at identifying SARS-CoV-2-infected cells from non-infected cells in the COVID Lung atlas, demonstrating utility for predicting cellular infection states in datasets where infection status is unknown or difficult to determine experimentally [31]. This capability can help identify novel mechanisms of pathogenesis and cellular defense responses that serve as potential therapeutic targets.
Icebear facilitates drug development by enabling prediction of human disease responses from mouse models [6]. The framework has been shown to accurately predict transcriptomic alterations in human Alzheimer's disease versus control samples based on mouse data, enabling transfer of knowledge from single-cell profiles in mouse disease models to human contexts [6]. This approach can significantly reduce the time and cost of preliminary drug validation studies.
Benchmarking studies indicate that foundation models serve as robust plug-and-play modules for various downstream tasks in biomedical research [5]. Their zero-shot embeddings capture biological insights into the relational structure of genes and cells, which provides a foundation for tasks ranging from cancer cell identification to drug sensitivity prediction across multiple cancer types and therapeutic compounds [5].
The field of cross-species foundation models is rapidly evolving, with several important directions emerging for future development. A critical challenge is improving model interpretability to build trust in predictions and facilitate biological discovery [10] [5]. While current models demonstrate impressive performance, understanding the biological reasoning behind their predictions remains challenging. Future iterations may incorporate more explicit biological knowledge representation and mechanism-based reasoning.
Another important direction is the integration of multimodal data beyond transcriptomics [10] [31]. While current models primarily focus on single-cell RNA sequencing data, incorporating information from epigenomics (scATAC-seq), proteomics, spatial transcriptomics, and imaging would provide a more comprehensive understanding of cellular states and functions. TranscriptFormer's developers have indicated plans to iterate and develop new models that combine multiple modalities [9].
For researchers selecting and implementing these models, benchmarking studies provide crucial guidance [5]. The performance of foundation models is highly task-dependent, with no single model consistently outperforming others across all applications. Researchers should consider factors including dataset size, task complexity, biological interpretability requirements, and available computational resources when selecting a model [5].
Practical implementation requires careful attention to technical specifications. TranscriptFormer, for instance, recommends A100 40GB GPUs for efficient inference, though can operate on GPUs with as little as 16GB VRAM by reducing batch size [31]. The model variants are specialized for different use cases: TF-Metazoa for broad cross-species generalization, TF-Exemplar for human and major model organisms, and TF-Sapiens for human-only tasks [31].
As these models continue to evolve, they represent significant steps toward the ambitious goal of building comprehensive virtual cell models that can simulate cellular behavior across scales, time frames, and scientific modalities [9]. This capability would dramatically accelerate biomedical research by enabling computational experimentation and hypothesis testing prior to wet-lab validation, ultimately bringing scientists closer to curing, preventing, and managing human diseases.
This document details application notes and protocols for developing foundation models for cross-species cell annotation, focusing on self-supervised learning strategies applied to multi-species single-cell RNA-sequencing (scRNA-seq) corpora. The primary goal is to create models that learn fundamental biological principles conserved across evolution, enabling robust cell type identification and functional prediction across diverse species, including those not seen during training.
Table 1: Representative Models in Cross-Species Cell Annotation
| Model Name | Architecture | Training Corpus Scale | Number of Species | Key Demonstrated Capabilities |
|---|---|---|---|---|
| TranscriptFormer [9] | Transformer | 112 million cells [9] | 12 [9] | Cross-species cell type classification, disease state prediction, gene-gene interaction prompting |
| scTab [28] | Feature Attention Network | 22.2 million cells [28] | Human (cross-tissue) [28] | Scaling of performance with data and model size, cross-tissue annotation using data augmentation |
| Genomic Language Models (gLMs) [32] | Transformer-based | Varies (genome sequences) | Multiple [32] | Functional constraint prediction, sequence design, transfer learning |
Table 2: Key Performance Highlights from Featured Models
| Model / Experiment | Task | Performance Summary |
|---|---|---|
| TranscriptFormer [9] | Cell type classification in out-of-distribution species (e.g., rhesus macaque, marmoset) | Able to identify cell types in species not included in its pre-training data [9] |
| TranscriptFormer [9] | Identification of SARS-CoV-2-infected cells | Surpassed baseline models at identifying infected from non-infected cells without fine-tuning [9] |
| scTab [28] | Cross-tissue cell type classification in human | Demonstrated that non-linear models outperform linear counterparts when trained at large scale [28] |
Objective: To train a transformer model on a large, evolutionarily diverse corpus of single-cell transcriptomics data to learn a foundational representation of cell states.
Materials:
Procedure:
Masked Language Model (MLM) Pre-training:
Model Architecture and Training:
Objective: To leverage a pre-trained model to annotate cell types in a species that was not part of the training set, without any further fine-tuning.
Materials:
Procedure:
Reference-Based Annotation:
Validation:
Objective: To use the pre-trained generative model to simulate the transcriptomic outcome of a gene knockout or overexpression.
Materials:
Procedure:
Table 3: Essential Resources for Cross-Species Cell Atlas Research
| Resource / Solution | Type | Function in Research |
|---|---|---|
| CZ CELLxGENE [28] [9] | Data Platform | Provides a massive, curated collection of single-cell datasets, essential for sourcing diverse training data for foundation models. |
| Cell Ontology [28] | Computational Ontology | Provides a standardized, hierarchical vocabulary for cell types, crucial for harmonizing labels across different studies and species. |
| TranscriptFormer [9] | AI Model | A pre-trained, generative cross-species model that can be used directly for cell annotation, disease prediction, and in-silico experiments. |
| Ortholog Mapping Databases | Bioinformatics Resource | Provides the genetic mappings between species, enabling the alignment of gene features to create a unified input for models. |
| Urban Institute R Theme (urbnthemes) [33] | Software Tool | An R package that helps standardize and automate the creation of publication-quality data visualizations, ensuring clarity and consistency. |
In the burgeoning field of cross-species cell annotation foundation models, the process of tokenization—converting raw, analog gene expression data into discrete, structured model inputs—serves as the critical foundational step. This process determines how biological reality is perceived computationally, directly impacting a model's ability to learn meaningful representations and generalize across species boundaries. Single-cell RNA sequencing (scRNA-seq) data presents unique computational challenges: it is inherently high-dimensional (measuring 20,000+ genes), sparse (with many zero values representing technical dropouts rather than biological absence), and crucially, non-sequential—unlike natural language, genes lack a natural ordering [34] [4] [10]. Effective tokenization strategies must overcome these challenges to enable transformers and other deep learning architectures to decipher the "language of biology" encoded within cellular transcriptomes.
Several distinct methodological paradigms have emerged for tokenizing single-cell data, each with particular strengths for cross-species modeling. The table below systematically compares the four primary approaches.
Table 1: Comparative Analysis of Primary Tokenization Paradigms
| Tokenization Paradigm | Core Mechanism | Key Advantages | Inherent Limitations | Representative Models |
|---|---|---|---|---|
| Value Projection | Directly projects continuous expression values into an embedding space. | Preserves full resolution of expression data; avoids information loss from binning. | Requires handling continuous values; potentially more computationally intensive. | CellFM [35], scFoundation [35] |
| Value Categorization | Bins continuous expression values into discrete "buckets," treating input as categorical. | Simplifies the learning task; enables use of classification-focused architectures. | Loss of granular expression information; binning strategy introduces subjectivity. | scBERT [35] [34], scGPT [35] [34] |
| Gene Ranking | Orders genes by expression level within each cell, using rank rather than value. | Reduces technical noise; robust to batch effects and normalization artifacts. | Discards absolute expression magnitude; arbitrary sequence ordering. | Geneformer [35], scGPT (optional) [35] |
| Scale-Free Tokenization | Segments expression vector into fixed-size sub-vectors via 1D-convolution. | Eliminates need for manual gene selection; handles full gene length efficiently. | Novel approach with less extensive validation across diverse tasks. | scSFUT [34] [36] |
Value projection-based methods treat gene expression as a continuous signal, mapping scalar expression values into a high-dimensional embedding space through a learned linear or non-linear transformation. For instance, the CellFM model, an 800-million parameter foundation model, uses this approach to recover vector embeddings of masked genes from their linear projections, preserving the complete information content of the input data [35]. This strategy is particularly valuable for tasks requiring precise expression quantification, such as predicting subtle transcriptional responses to perturbations.
In contrast, value categorization methods discretize the continuous spectrum of gene expression into a finite set of categories. The scBERT model, for example, employs a binning strategy that converts expression values into discrete tokens, effectively transforming the prediction problem into a classification task [35] [34]. While this approach simplifies the learning objective and can improve training stability, it inevitably sacrifices some resolution of the original expression data, potentially obscuring biologically meaningful variations in gene expression levels.
Gene ranking strategies fundamentally reconceptualize the input representation by ignoring absolute expression values and instead focusing on relative expression relationships within each cell. Models like Geneformer and tGPT are trained on sequences of genes ordered by their expression levels, learning to predict gene ranks based on cellular context [35]. This method demonstrates particular robustness to technical variations between datasets and sequencing platforms, making it potentially valuable for cross-species applications where absolute expression levels may not be directly comparable.
The recently proposed scale-free tokenization approach, implemented in the scSFUT model, offers a significantly different paradigm. Instead of selecting highly variable genes or using ranking, scSFUT processes the entire gene expression vector by segmenting it into dimensionally reduced, information-dense sub-vectors using a fixed window size and 1D-convolution [34] [36]. This "tokenization-first" strategy allows the model to learn directly from high-dimensional data at its original scale without manual gene filtering, potentially capturing broader biological patterns that might be overlooked by gene-selection methods.
Implementing effective tokenization requires meticulous data preprocessing to ensure consistency, especially for cross-species applications. The following protocol outlines the standardized workflow used by leading models:
Data Collection and Curation: Assemble diverse single-cell datasets from public repositories such as CELLxGENE, NCBI GEO, ENA, and species-specific atlases. For cross-species training, ensure representation from target organisms (e.g., human, mouse, zebrafish). The CellFM model, for instance, was trained on approximately 100 million human cells aggregated from 19,914 samples across different organs [35].
Quality Control and Filtering: Apply stringent quality control metrics tailored to each dataset and species. Standard parameters include:
Gene Name Standardization: Convert gene identifiers to standardized nomenclature according to authoritative databases (HGNC for human, MGI for mouse). This critical step enables cross-dataset and cross-species alignment by ensuring consistent gene identification [35].
Normalization: Apply library size normalization (e.g., to 10,000 reads per cell) followed by log-transformation to stabilize variance and make expression values more comparable across cells and datasets [34] [7].
Integration and Batch Correction: Employ computational methods (e.g., Harmony, SCVI) to mitigate technical batch effects while preserving biological variation, particularly crucial when integrating data from multiple studies, platforms, and species [4] [7].
Table 2: Detailed Tokenization Methodologies by Model
| Model | Tokenization Details | Expression Value Handling | Positional Encoding | Special Tokens |
|---|---|---|---|---|
| scBERT | Gene-based tokens from pre-defined vocabulary | Binned into discrete expression levels | Standard transformer positional encoding | Cell-type tokens, separation tokens |
| scGPT | Gene identity + expression value embeddings | Either binned or normalized continuous values | Learnable positional embeddings | [PERT] for perturbations, [CLS] for cell embedding |
| Geneformer | Genes ranked by expression level; top 2,048 genes used | Relative ranking only (absolute values discarded) | Position based on rank order | Context tokens for tissue/disease state |
| CellFM | Value projection with linear embedding of expression | Continuous values preserved via projection | Modified RetNet positional encoding | LoRA adapters for efficient fine-tuning |
| scSFUT | Fixed-window segmentation of full expression vector | Raw values processed via 1D-convolution | Bias-free attention mechanism | Reconstruction tokens for self-supervision |
Value Categorization Protocol (scBERT):
Gene Ranking Protocol (Geneformer):
Scale-Free Tokenization Protocol (scSFUT):
Diagram Title: Tokenization Pathways for Cross-Species Foundation Models
Table 3: Essential Research Reagents and Computational Tools
| Category | Item/Resource | Specifications & Functions | Example Use Cases |
|---|---|---|---|
| Data Resources | CZ CELLxGENE | Provides >100 million standardized single cells across tissues and species; unified data structure | Pretraining data sourcing; cross-species reference atlas [4] [10] |
| PanglaoDB | Curated compendium of single-cell transcriptomics data with marker gene annotations | Marker gene validation; cell type annotation priors [4] [10] | |
| Software Frameworks | Scanpy | Python-based toolkit for single-cell analysis; standard preprocessing pipeline | Quality control; normalization; differential expression [34] [7] |
| AnnDictionary | LLM-provider-agnostic Python package built on AnnData and LangChain | Automated cell type annotation; multi-LLM benchmarking [7] | |
| BioLLM | Unified framework for integrating diverse single-cell foundation models | Standardized model benchmarking; streamlined model switching [37] | |
| Model Architectures | Transformer Variants | ERetNet (CellFM), Performer (scFoundation), standard Transformer (scBERT) | Balancing computational efficiency with model performance [35] [34] |
| LoRA Mechanism | Low-Rank Adaptation for parameter-efficient fine-tuning | Adapting foundation models to new species with limited data [35] |
Diagram Title: Research Tool Ecosystem for Tokenization
Tokenization represents the crucial bridge between biological measurements and computational analysis in cross-species cell annotation foundation models. The choice of tokenization strategy—whether value projection, categorization, gene ranking, or scale-free approaches—fundamentally shapes what patterns a model can discover and how well it can generalize across biological contexts and species boundaries. As these models continue to evolve, we anticipate further innovation in tokenization methods that more effectively capture the hierarchical, dynamic nature of gene regulatory programs while remaining computationally tractable. The development of standardized frameworks like BioLLM for comparing these approaches will accelerate progress toward more accurate, interpretable, and biologically-grounded foundation models capable of unlocking the fundamental principles of cellular function across the tree of life.
Cross-species cell annotation foundation models represent a transformative advancement in single-cell biology. These models, pre-trained on tens of millions of cells from multiple species, learn fundamental biological principles that enable accurate cell type identification across evolutionary distances. The purpose of this application is to provide researchers with a robust, standardized methodology for annotating cell types in new, unlabeled data, including from species not present in the training corpus. This approach significantly reduces the reliance on manually curated labels and specialized, single-species reference atlases.
Table 1: Performance metrics of leading foundation models for cell type prediction.
| Model Name | Training Scale | Key Architectural Features | Reported Cross-Species Performance |
|---|---|---|---|
| TranscriptFormer [9] | 112 million cells, 12 species | Transformer, generative | State-of-the-art in classifying cell types in out-of-distribution species (e.g., rhesus macaque, marmoset) without fine-tuning. |
| CellFM [35] | 100 million human cells | 800M parameters, ERetNet layers, LoRA | Outperforms existing models in cell annotation tasks on diverse human cell datasets. |
| scBERT [4] | Millions of transcriptomes | Transformer, value categorization | Effective for human cell type annotation via fine-tuning on target datasets. |
Objective: To annotate cell types in a novel single-cell RNA-seq dataset from a target species using a pre-trained cross-species foundation model.
Materials and Reagents:
Procedure:
Model Inference and Embedding Generation:
Cell Type Annotation:
The progression of complex diseases like cancer is often nonlinear, characterized by a sudden deterioration from a pre-disease state to a disease state. Identifying this critical transition point is crucial for early intervention. This application note details a model-free method, Local Network Wasserstein Distance (LNWD), which uses single-sample analysis to detect these pre-disease states by measuring statistical perturbations in molecular networks [38].
Table 2: Application of LNWD in identifying critical states across complex diseases.
| Disease / Condition | Dataset Source | Key Finding | Validation Method |
|---|---|---|---|
| Renal Cancers (KIRP, KIRC) | TCGA | Successful identification of the critical pre-disease state before cancer progression. | Survival analysis and molecular network dynamics change analysis [38]. |
| Lung Adenocarcinoma (LUAD) | TCGA | Detection of critical transition signals from network perturbation. | Consistent with clinical staging and outcome data [38]. |
| Acute Lung Injury (Mice) | GEO: GSE2565 | Provided early warning signals for disease deterioration. | Matched with experimental time-course data [38]. |
| Rheumatoid Arthritis [39] | Leiden & TACERA Cohorts | Identified 4 distinct disease trajectories (A: high ESR; D: many inflamed joints). | Replicated in an independent cohort; linked to patient-reported outcomes. |
Objective: To compute LNWD scores for a series of samples to identify the critical pre-disease state during disease progression.
Materials and Reagents:
Procedure:
Local Network Construction:
Calculation of Local Network Wasserstein Distance (LNWD):
Identification of the Critical State:
Understanding the functional relationships between genes is fundamental to biology. Gene-gene interaction mapping identifies pairs of genes that, when mutated together, result in an unexpected phenotype (e.g., synthetic lethality), revealing functional redundancy, compensation, and pathway relationships. This application covers both experimental and AI-driven computational methods for large-scale gene interaction mapping.
Table 3: Approaches for large-scale genetic interaction mapping.
| Method / Technology | Scale / Model | Key Outcome | Application Context |
|---|---|---|---|
| Dual Tn-seq [40] | Surveyed ~1 million gene pairs in S. pneumoniae; over 1 billion double mutants created. | Identified 200+ previously unknown genetic interactions; discovered a new enzyme family. | High-throughput screening in bacteria for functional genomics and antibiotic target discovery. |
| CRISPR-based qGI Profiling [41] | ~4 million gene pairs tested in human HAP1 cells; ~90,000 genetic interactions mapped. | Generated a hierarchical model of human cell function; recapitulated and expanded on DepMap data. | Unbiased functional genomics in a human cell line model to understand genetic architecture. |
| AI Model Prediction (TranscriptFormer) [9] | Generative model trained on 112M cells. | Predicts gene-gene co-expression relationships in specific cell types and conditions via prompting. | In-silico hypothesis generation for gene function and interaction before wet-lab validation. |
Objective: To systematically identify genetic interactions (e.g., synthetic lethal pairs) in a bacterial pathogen on a genome-wide scale.
Materials and Reagents:
Procedure:
Competitive Growth Assay:
Sequencing and Data Analysis:
Table 4: Essential tools for advanced single-cell and gene interaction analysis.
| Item / Resource | Type | Primary Function |
|---|---|---|
| CZ CELLxGENE [9] [4] | Data Platform | Provides unified access to millions of curated single-cell datasets for model training and analysis. |
| TranscriptFormer [9] | AI Foundation Model | A generative model for cross-species cell type prediction, disease state identification, and gene interaction prediction via prompting. |
| Barcoded Transposon Library (e.g., for Dual Tn-seq) [40] | Wet-lab Reagent | Enables high-throughput generation and tracking of mutants in pooled screens. |
| CRISPR gRNA Library (e.g., TKOv3) [41] | Wet-lab Reagent | Allows for systematic knockout of genes in human cells for essentiality and genetic interaction screens. |
| Local Network Wasserstein Distance (LNWD) [38] | Computational Algorithm | Detects critical transition states in complex diseases from single-sample transcriptomic data. |
| PPI Network (e.g., from STRING-db) [38] | Bioinformatics Resource | Provides prior knowledge of protein interactions for constructing local networks in algorithms like LNWD. |
Cross-species integration of single-cell RNA-sequencing (scRNA-seq) data enables researchers to explore evolutionary relationships and identify conserved and divergent cell types across species. The fundamental challenge lies in distinguishing true biological variation from technical artifacts and species-specific effects, often termed "species effect," where cells from the same species exhibit higher transcriptional similarity to each other than to their cross-species counterparts [42]. Robust benchmarking is therefore essential to ensure integration results reflect biology rather than computational artifacts. This protocol outlines comprehensive metrics and methodologies for evaluating integration performance, with a focus on balancing species-mixing with biological conservation.
Performance evaluation in cross-species integration spans two primary objectives: achieving adequate mixing of homologous cell types from different species, and preserving meaningful biological heterogeneity within and across cell types. The table below summarizes the key metrics employed for these purposes, their mathematical basis, and ideal values.
Table 1: Comprehensive Metrics for Benchmarking Cross-Species Integration Performance
| Metric Category | Metric Name | Description | Interpretation & Ideal Value |
|---|---|---|---|
| Species Mixing | Average Silhouette Width (ASW) batch | Measures how closely cells from the same species cluster together versus how separated they are from other species [42]. | Value Range: -1 to 1Ideal: Closer to 0, indicating no batch (species) structure. |
| Normalized Mutual Information (NMI) | Quantifies the similarity between the species label distribution and the clustering result [42]. | Value Range: 0 to 1Ideal: Lower values indicate better mixing (less association). | |
| Alignment Score (for SAMap) | Quantifies the percentage of cross-species neighbors for each cell [42]. | Value Range: 0 to 1Ideal: Higher values indicate better alignment of homologous types. | |
| Biology Conservation | Accuracy Loss of Cell type Self-projection (ALCS) | A novel metric quantifying the loss of cell type distinguishability post-integration using a self-projection concept [42]. | Value Range: 0 to 1Ideal: Lower values indicate better preservation of biological heterogeneity. |
| Average Silhouette Width (ASW) cell type | Measures how closely cells of the same cell type cluster together after integration [42]. | Value Range: -1 to 1Ideal: Higher values indicate cell types are well-separated. | |
| Normalized Mutual Information (NMI) cell type | Quantifies the similarity between the cell type label distribution and the clustering result [42]. | Value Range: 0 to 1Ideal: Higher values indicate better preservation of cell type identity. | |
| Cell-type Label Transfer Accuracy (ARI) | Assesses annotation transfer using Adjusted Rand Index between original and transferred labels [42]. | Value Range: -1 to 1Ideal: Values closer to 1 indicate highly accurate cross-species annotation. | |
| Overall Score | Integrated Score | A weighted average of the scaled species-mixing and biology conservation scores [42]. | Ideal Weighting: 40% species-mixing, 60% biology conservation [42]. |
(0.4 * Species Mixing Score) + (0.6 * Biology Conservation Score) [42].The following diagram illustrates the logical flow and key decision points in the cross-species integration benchmarking pipeline.
Table 2: Key Computational Tools and Resources for Cross-Species Integration
| Tool/Resource Name | Type | Primary Function in Context |
|---|---|---|
| BENGAL Pipeline [42] | Computational Pipeline | A freely available benchmark and assessment pipeline for cross-species integration, facilitating the evaluation of multiple strategies. |
| scANVI [42] [43] | Integration Algorithm | A semi-supervised deep learning model that leverages cell-type annotations to achieve a balance between species-mixing and biology conservation. |
| scVI [42] [43] | Integration Algorithm | A probabilistic deep learning framework that effectively models technical and biological noise for data integration. |
| Seurat V4 [42] | Integration Algorithm | Uses CCA or RPCA to identify anchors between datasets, demonstrating strong performance in cross-species tasks. |
| SAMap [42] | Integration Algorithm | Specialized for whole-body atlas integration between species with challenging gene homology, using reciprocal BLAST for gene-gene mapping. |
| ENSEMBL Compara [42] | Bioinformatics Database | Provides pre-computed gene homology mappings (e.g., one-to-one, one-to-many orthologs) essential for creating a shared feature space. |
| CZ CELLxGENE [10] [9] | Data Platform | Provides unified access to millions of curated single-cell datasets across multiple species, serving as a key data source for pretraining and testing. |
| TranscriptFormer [9] | Foundation Model | A generative, multi-species model trained on 112M cells from 12 species, enabling cross-species cell type prediction and analysis without fine-tuning. |
The identification of homologous genes across species is a cornerstone of comparative genomics, crucial for inferring gene function, understanding evolutionary history, and annotating cells. Traditional methods that rely on one-to-one ortholog mapping are increasingly inadequate for capturing the complex reality of genomic evolution, which is replete with gene duplications, paralogs, and co-orthologs. This application note details advanced computational protocols and metrics designed to address these complexities. We frame these methodologies within the emerging paradigm of cross-species cell annotation foundation models, which leverage vast datasets and self-supervised learning to create unified representations of biological data. The provided protocols for alignment-free sequence comparison, iterative graph matching, and homology-independent integration, alongside benchmarking strategies, equip researchers with the tools to achieve more accurate and biologically meaningful cross-species comparisons.
The premise of one-to-one ortholog mapping, where a single gene in one species corresponds to a single gene in another, often fails to reflect biological complexity. Evolutionary events like gene duplication and whole-genome duplication lead to the proliferation of co-orthologs—paralogous genes within a genome that are collectively orthologous to one or more genes in another species [44] [45]. Relying solely on one-to-one mappings risks misannotating these genes and obscuring true evolutionary relationships.
This challenge is magnified in the context of cross-species cell annotation foundation models (scFMs). These models, such as TranscriptFormer and scGPT, are trained on millions of single-cell transcriptomes from multiple species to learn a unified representation of cellular biology [10] [9]. Their goal is to enable tasks like cell type annotation, disease state prediction, and gene-gene interaction analysis across vast evolutionary distances. The performance of these models is fundamentally dependent on the quality and biological accuracy of the gene homology mappings used to align the feature spaces of different species. Annotation heterogeneity—the use of different gene annotation methods across the species in a study—can artificially inflate the number of lineage-specific genes by up to 15-fold, presenting a major source of artifact in comparative genomics [46]. Moving beyond simple one-to-one mapping is therefore not merely an academic exercise but a practical necessity for robust biological discovery.
Principle: Alignment-free methods operate on the hypothesis that similar sequences share a significant number of k-mers (contiguous substrings of length k). These methods circumvent the computational intensity of traditional alignment algorithms, enabling rapid all-against-all comparison of large sequence sets, which is often the first bottleneck in ortholog assignment pipelines [44].
Protocol: The afree Algorithm Workflow
The following workflow visualizes the core process of the afree algorithm for efficient large-scale sequence comparison:
Detailed Methodology:
L as a tuple containing:
L into a 64-bit machine word. The encoding scheme is as follows:
L based on the encoded k-mer value. This clustering allows the algorithm to efficiently identify all sequence pairs that share common k-mers in a single pass, a key factor in its scalability [44].Principle: After initial similarity detection, an iterative graph matching strategy can be employed to resolve complex many-to-many orthologous relationships. This method operates on a bipartite graph where nodes represent genes from two genomes, and weighted edges represent their sequence similarity scores. The goal is to find a matching that maximizes the sum of similarity scores while allowing for the identification of co-orthologs [44].
Protocol: Iterative Graph Matching for Co-orthology
afree algorithm) [44].Principle: The Mestortho algorithm is based on the phylogenetic minimum evolution (ME) criterion. It postulates that a set of sequences consisting purely of orthologs will have a smaller sum of branch lengths (the Minimum Evolution Score, or MES) in a neighbor-joining tree than a set that includes one or more paralogous relationships [45].
Protocol: Orthology Detection with Mestortho
Integrating data across species requires mapping genes via homology, and the chosen strategy significantly impacts the results. A comprehensive benchmark of 28 strategies (4 homology mapping methods combined with 9 integration algorithms) provides critical guidance [47].
Table 1: Benchmarking Metrics for Cross-Species Integration Strategies
| Metric Category | Metric Name | Description | What It Quantifies |
|---|---|---|---|
| Species Mixing | Average Silhouette Width (ASW) Batch | Measures how close cells are to cells of the same species versus others in the embedding. | Better mixing of homologous cell types across species. |
| Graph Integration Local Inverse Simpson's Index (GILISI) | Assesses the local diversity of species labels among a cell's nearest neighbors. | Whether cell neighborhoods contain a mix of species. | |
| Biology Conservation | Average Silhouette Width (ASW) Cell Type | Measures how close cells are to cells of the same type versus other types. | Preservation of distinct cell type clusters. |
| Normalized Mutual Information (NMI) | Quantifies the similarity between the cell type clustering before and after integration. | Conservation of the original biological grouping. | |
| ALCS (New Metric) | Accuracy Loss of Cell type Self-projection; quantifies the blending of distinct cell types post-integration. | Protection against overcorrection, which obscures species-specific cell types [47]. |
Table 2: Key Findings from Benchmarking 28 Integration Strategies [47]
| Finding | Implication for Experimental Design |
|---|---|
| The choice of integration algorithm (e.g., scANVI, scVI, SeuratV4) has a greater impact on performance than the homology mapping method. | Prioritize selection of a robust integration algorithm. |
| For evolutionarily distant species, including in-paralogs (one-to-many orthologs) in the gene mapping is beneficial. | Use a more inclusive homology mapping strategy for distant species comparisons. |
| SAMap, which uses de-novo BLAST instead of pre-defined orthology tables, outperforms other methods for whole-body atlas integration between species with challenging gene homology annotation. | Use SAMap for complex integrations, especially when standard orthology tables are incomplete or unreliable [47]. |
| The new ALCS metric is critical for identifying overcorrection, where algorithms force integration so strongly that they blend biologically distinct cell types. | Always use ALCS alongside other metrics to ensure biological heterogeneity is preserved [47]. |
Table 3: Essential Software and Database Tools for Complex Homology Analysis
| Tool Name | Type | Primary Function | Application Context |
|---|---|---|---|
afree & EGM2 [44] |
Algorithm/Pipeline | Alignment-free all-against-all sequence comparison & iterative ortholog detection. | Fast homology search foundation for large datasets. |
| Mestortho [45] | Python Program | Detects orthologs from a multiple sequence alignment using the minimum evolution criterion. | Phylogeny-based orthology inference for curated gene families. |
| BENGAL Pipeline [47] | Benchmarking Pipeline | Systematically tests and evaluates cross-species integration strategies. | Selecting the optimal data integration method for a given project. |
| SAMap [47] | Integration Algorithm | Uses iterative BLAST and cell-cell mapping to integrate data, ideal for challenging homology. | Aligning whole-body atlases across evolutionarily distant species. |
| SegmentNT [48] | DNA Foundation Model | Fine-tunes pretrained models to annotate genic and regulatory elements at single-nucleotide resolution. | Genome annotation without relying on pre-defined gene models. |
| TranscriptFormer [9] | Single-Cell Foundation Model (scFM) | A generative model trained on 112M cells from 12 species for cross-species prediction. | Cell type annotation, disease state prediction, and gene-gene interaction analysis across species. |
| CZ CELLxGENE [10] [9] | Data Platform | Provides unified access to millions of annotated single-cell datasets. | Source of curated training data for scFMs and cross-species analysis. |
Navigating the complexity of gene homology requires a multifaceted toolkit that moves decisively beyond the simplicity of one-to-one ortholog mapping. The protocols and strategies detailed herein—from scalable alignment-free sequencing and phylogeny-based orthology assignment to rigorously benchmarked integration methods—provide a roadmap for robust cross-species genomic and single-cell analysis. As the field advances towards powerful foundation models capable of synthesizing biological information across millions of cells and billions of years of evolution, the accurate resolution of gene homology remains a critical foundation. By adopting these advanced methodologies, researchers can mitigate artifacts, uncover true evolutionary relationships, and fully leverage the potential of cross-species foundation models to illuminate cellular function and disease.
Mitigating Batch Effects and Sequencing Depth Inconsistencies
In cross-species cell annotation research, single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics data are plagued by technical variations known as batch effects. These non-biological differences, arising from factors like different sequencing platforms, laboratory conditions, or sample preparation protocols, can obscure genuine biological signals and compromise the integrity of foundational models. Similarly, inconsistencies in sequencing depth—the number of reads per cell—can create artificial variation in gene detection, misrepresenting true cellular states. For foundation models aiming to create a unified representation of cells across diverse species and tissues, the robust mitigation of these technical confounders is not merely a preprocessing step but a foundational prerequisite. This document outlines standardized protocols and application notes for identifying and correcting these issues, ensuring reliable and reproducible cell annotation.
The performance of various batch-effect correction and analysis tools can be quantitatively assessed based on their underlying algorithms, data handling capabilities, and performance metrics. The table below summarizes these aspects for several methods discussed in recent literature.
Table 1: Comparison of Batch Effect Mitigation and Analysis Frameworks
| Method / Framework Name | Underlying Architecture / Algorithm | Data Types Handled | Key Metrics and Performance | Primary Application in Annotation | Key Advantages |
|---|---|---|---|---|---|
| Harmony [49] [50] | Iterative clustering and linear mixture modeling | scRNA-seq, CITE-seq | Effectively integrates datasets from 38 tissues and 700 individuals; improves cross-dataset gene expression program (GEP) reproducibility [49]. | Data integration for unified cell state definition. | Corrects gene-level data while preserving non-negative values for component-based models [49]. |
| T-CellAnnoTator (TCAT) / starCAT [49] | Consensus Non-negative Matrix Factorization (cNMF) with batch correction | scRNA-seq, CITE-seq | Identified 46 reproducible GEPs; accurately infers GEP usage in query datasets (Pearson R > 0.7) [49]. | Quantifying predefined GEP activities in new cells/datasets. | Provides a consistent cell state representation across datasets; robust for rare GEPs and fast for query data [49]. |
| Nicheformer [3] | Transformer-based Foundation Model | Dissociated scRNA-seq, Spatial Transcriptomics (MERFISH, Xenium, etc.) | Outperforms models trained only on dissociated data (e.g., Geneformer, scGPT) in spatial tasks like niche and composition prediction [3]. | Learning spatially aware cellular representations and transferring spatial context to dissociated data. | Jointly trained on dissociated and spatial data; uses rank-based gene encoding for robustness to technical biases [3]. |
| TranscriptFormer [9] | Transformer-based Foundation Model | Cross-species scRNA-seq | Achieves state-of-the-art performance in cross-species cell type classification and disease state prediction, even for "out-of-distribution" species [9]. | Generalizing biological patterns across vast evolutionary distances. | Trained on 112 million cells from 12 species; enables prediction without species-specific labeled data [9]. |
This protocol details the use of the starCAT framework to annotate cell states in a new, cross-species query dataset using a pre-defined catalog of Gene Expression Programs (GEPs).
1. Principle The starCAT pipeline avoids de novo analysis of query data. Instead, it leverages a fixed, multi-dataset catalog of GEPs—learned from a large, batch-corrected reference—to quantify the activity of these conserved programs in new query datasets. This ensures consistent annotation and enables the detection of rare cell states that might be missed in smaller query datasets [49].
2. Reagents and Materials Table 2: Essential Research Reagent Solutions
| Item | Function / Description |
|---|---|
| Reference GEP Catalog | A pre-computed set of consensus GEPs (e.g., the 46 T cell cGEPs) derived from multiple batch-corrected datasets. Serves as the fixed coordinate system for annotation [49]. |
| Processed Query Dataset | A quality-controlled (QC'd) gene expression matrix (cells x genes) from a new experiment, which may be from a different species or technology. |
| Batch-Corrected Reference Data | Large, integrated scRNA-seq dataset(s) (e.g., the 1.7 million T cell reference) used to derive the robust GEP catalog. Corrected with tools like Harmony [49]. |
| CITE-seq Antibody-Derived Tags (Optional) | Surface protein expression data from CITE-seq, integrated into the GEP spectra to enhance biological interpretability and annotation confidence [49]. |
3. Procedure 1. Reference GEP Catalog Construction: a. Data Collection & Harmonization: Collate multiple large-scale scRNA-seq datasets encompassing the cell types of interest across desired species and conditions. b. Batch Effect Correction: Apply a batch correction method like Harmony to the raw, non-negative count data to generate a harmonized gene-level expression matrix [49]. c. Consensus NMF (cNMF): Run cNMF on the batch-corrected data to learn robust GEP spectra (gene weights) and their usage in each reference cell. This involves multiple runs of NMF followed by consensus clustering [49]. d. Curation & Annotation: Manually curate the resulting GEPs, removing technical artifacts and annotating them based on top-weighted genes and gene-set enrichment analysis. This creates the final reference catalog.
This protocol describes using the Nicheformer foundation model to enrich dissociated scRNA-seq data with spatial context and mitigate technology-specific biases.
1. Principle Nicheformer is a transformer model pretrained on a massive, curated corpus (SpatialCorpus-110M) containing both dissociated and spatially resolved single-cell data. It learns a joint representation that captures spatial context, allowing it to perform spatially aware tasks and transfer spatial information from targeted spatial transcriptomics to dissociated scRNA-seq data, which typically has broader gene coverage [3].
2. Procedure 1. Model Input Preparation (Tokenization): a. For a given cell (from either dissociated or spatial data), convert its gene expression vector into a sequence of gene tokens. b. Rank-based Encoding: Order the gene tokens by their expression level relative to the technology-specific non-zero mean, not by absolute value. This strategy enhances robustness to technology-driven batch effects [3]. c. Contextual Tokens: Prepend the sequence with special tokens indicating the species (e.g., human, mouse), data modality (dissociated vs. spatial), and specific technology (e.g., MERFISH, 10X) [3].
Cross-species analysis of single-cell RNA-sequencing (scRNA-seq) data presents a powerful approach for understanding evolutionary biology and cellular function. A significant challenge in this field is the robust integration of data across different species to identify homologous cell types accurately. This Application Note details the implementation and findings of a comprehensive benchmarking study evaluating 28 distinct integration strategies for cross-species single-cell data. The content is framed within the broader research context of developing cross-species cell annotation foundation models, which require high-quality, integrated data for training and validation to accurately decipher the 'language' of cells across different organisms [10].
The BENchmarking strateGies for cross-species integrAtion of singLe-cell RNA sequencing data (BENGAL) pipeline was developed to systematically assess cross-species integration strategies [47]. The pipeline evaluates strategies based on their ability to mix cells from known homologous types across species (species-mixing) while preserving biological heterogeneity present within each species (biology conservation). Prior to analysis, user-performed quality control and curation of cell ontology annotations are essential.
The benchmarking was conducted across 16 diverse biological tasks to ensure broad applicability [47]. These tasks were designed to evaluate performance in different scenarios, as summarized in Table 1.
Table 1: Summary of Benchmarking Biological Tasks
| Task Category | Biological Context | Species Involved | Key Evaluation Focus |
|---|---|---|---|
| Adult Tissue Analysis | Pancreas, Hippocampus, Heart | Multiple vertebrate species | Integration of homologous cell types in specialized tissues |
| Whole-Body Embryonic Development | Embryonic atlases | Species with challenging gene homology | Handling of complex, whole-organism data |
| Multi-Species Integration | Heart data | 5 species | Upper limit of species numbers in a single integration |
| Pairwise Divergence Analysis | Various tissues | 10 pairwise tasks | Impact of evolutionary divergence time on integration |
The benchmarking evaluated 28 strategies, resulting from combinations of 4 gene homology mapping methods and 10 integration algorithms, plus the standalone method SAMap [47]. Performance was quantitatively assessed using established metrics for species mixing and biology conservation, combined into an integrated score. Table 2 summarizes the performance of the top-performing strategies.
Table 2: Top-Performing Integration Strategies and Key Findings
| Integration Strategy | Overall Performance | Strengths and Optimal Use Cases |
|---|---|---|
| scANVI | High integrated score | Achieves optimal balance between species-mixing and biology conservation [47]. |
| scVI | High integrated score | Robust performance across multiple tissue types and species pairs [47]. |
| SeuratV4 (CCA/RPCA) | High integrated score | Effective for standard cross-species comparisons with well-annotated genomes [47]. |
| SAMap | Specialized outperformer | Superior for whole-body atlas integration between species with challenging gene homology annotation [47]. |
| Strategies with In-Paralogs | Beneficial for distant species | Including in-paralogs in the gene mapping step improves evolutionarily distant species integration [47]. |
Integration outputs were assessed from three primary perspectives [47]:
A key metric developed for this benchmark was the Accuracy Loss of Cell type Self-projection (ALCS), which specifically quantifies the unwanted blending of distinct cell types within a species after integration, indicating overcorrection [47].
Objective: To map orthologous genes between species and create a concatenated raw count matrix for integration. Reagents & Materials:
Procedure:
Objective: To apply integration algorithms to the concatenated data matrix to generate a joint embedding. Reagents & Materials:
Procedure:
Objective: To quantitatively and qualitatively evaluate the quality of the integrated embedding. Reagents & Materials:
Procedure:
Table 3: Essential Research Reagents and Computational Tools
| Item Name | Function / Purpose | Application Context |
|---|---|---|
| ENSEMBL Comparative Tool | Provides gene homology mapping (orthology predictions) between species [47]. | Essential pre-processing step for identifying comparable genes across species before data integration. |
| BENGAL Pipeline | A freely available cross-species integration and assessment pipeline [47]. | Core framework for running and evaluating the 28 integration strategies in a standardized manner. |
| SCCAF (Single Cell Clustering Assessment Framework) | Machine learning-based framework for self-projection and cluster assessment [47]. | Used to implement the ALCS metric and for annotation transfer tests. |
| Reciprocal BLAST | Tool for de-novo gene-gene homology analysis [47]. | Critical component of the SAMap workflow, especially for species with poor existing gene annotations. |
| scANVI / scVI Algorithms | Probabilistic deep learning models for single-cell data integration [47]. | Top-performing algorithms for general-purpose cross-species integration tasks. |
| SeuratV4 with CCA/RPCA | Integration methods using canonical correlation analysis or reciprocal PCA [47]. | Robust, well-established methods for cross-species integration, particularly effective with clear one-to-one homologous cell types. |
Within the framework of cross-species cell annotation foundation model research, biological validation is a critical step for assessing model performance and translational utility. This application note details experimental protocols and validation strategies for two complex biological systems: brain cell types and spermatogenesis. The case study on spermatogenesis demonstrates how foundation models like TranscriptFormer enable the transfer of cell-type annotations across evolutionarily distant species, a capability with profound implications for evolutionary biology and translational research [9]. Advanced single-cell RNA sequencing (scRNA-seq) technologies now provide the resolution necessary to deconstruct the dynamic process of spermatogenesis across mammalian species, revealing both conserved and divergent molecular programs [51] [52].
Spermatogenesis is a highly conserved yet rapidly evolving process, making it an ideal system for validating cross-species foundation models. The process involves precisely orchestrated transitions from spermatogonia (mitotic stem cells) through spermatocytes (undergoing meiosis) to spermatids (post-meiotic haploid cells) [51] [52]. Recent evolutionary analyses of single-nucleus transcriptome data from testes of 11 species covering all main mammalian lineages (eutherians, marsupials, and monotremes) and birds have revealed that the rapid evolution of the testis is driven by accelerated evolutionary rates in late spermatogenic stages [51]. This evolutionary context provides a robust framework for testing the ability of foundation models to identify homologous cell types despite significant molecular divergence.
The TranscriptFormer model represents a significant advancement for cross-species biological validation. As a generative, multi-species model for single-cell transcriptomics, it was trained on 112 million cells from 12 different species, covering 1.5 billion years of evolution [9]. For spermatogenesis research, TranscriptFormer demonstrates exceptional capability in identifying cell types in species not included in its training data (such as rhesus macaque and marmoset) and accurately transferring labels across related species [9]. This functionality enables researchers to annotate spermatogenic cell types in poorly characterized species using models trained on well-annotated reference datasets, significantly accelerating the mapping of spermatogenesis across mammals.
Table 1: Key Quantitative Findings from Cross-Species Spermatogenesis Studies
| Finding | Measurement | Biological Significance |
|---|---|---|
| Evolutionary Rate Variation | Rate of expression evolution substantially higher in postmeiotic haploid cell types (rSD and eSD) compared to diploid spermatogenic cells [51]. | Explains rapid evolution of the testis; suggests reduced pleiotropic constraints and haploid selection in late spermatogenesis [51]. |
| Primate Analysis Resolution | Evolutionary rates progressively increase from late meiosis (pachytene SC) until the end of spermiogenesis (late eSD) [51]. | Provides cellular source for rapid testis evolution and enables fine-grained analysis of primate spermatogenesis [51]. |
| TranscriptFormer Performance | Can identify cell types in species not seen during training (rhesus macaque, marmoset) [9]. | Enables translation of biological insights across species and annotation of cell types in unmapped species [9]. |
| Single-Cell Atlas Scale | 97,521 high-quality nuclei from 11 species with median of 1,856 RNA molecules per cell [51]. | Provides comprehensive resource for investigating testis biology across mammals [51]. |
The following diagram illustrates the integrated computational and experimental workflow for cross-species validation of spermatogenic cell types using foundation models:
Purpose: To generate high-quality single-nucleus transcriptome data from testicular tissues across multiple mammalian species for foundational model training and validation [51].
Materials:
Procedure:
Library Preparation:
Sequencing:
Quality Control:
Purpose: To integrate snRNA-seq data across species and quantify evolutionary changes in spermatogenic gene expression programs [51].
Materials:
Procedure:
Cross-Species Integration:
Cell Type Annotation:
Evolutionary Rate Calculation:
Quality Control:
Purpose: To adapt foundation models for cross-species annotation of spermatogenic cell types [9].
Materials:
Procedure:
Model Fine-Tuning:
Cross-Species Prediction:
Quality Control:
The molecular regulation of spermatogenesis involves complex signaling pathways that are conserved across mammals. The following diagram illustrates key pathways and their components identified through cross-species analysis:
Table 2: Essential Research Reagents for Cross-Species Spermatogenesis Studies
| Reagent/Category | Specific Examples | Function/Application |
|---|---|---|
| Single-Cell Platforms | 10x Genomics Chromium, BD Rhapsody | High-throughput single-cell RNA sequencing of testicular cell populations [51] [53] |
| Nuclear Isolation Kits | 10x Nucleus Isolation Kit, Active Motif | Isolation of intact nuclei from frozen testicular tissues for snRNA-seq [51] |
| Cell Type Markers | UTF1 (spermatogonia), SYCP3 (spermatocytes), PRM1 (spermatids) | Identification and validation of spermatogenic cell types across species [51] |
| Foundation Models | TranscriptFormer, Nicheformer | Cross-species cell type annotation and prediction of spatial context [9] [3] |
| Spatial Transcriptomics | 10x Visium, MERFISH, Xenium | Validation of cellular niches and germ cell-soma interactions [3] |
| Bioinformatics Tools | SCANPY, Seurat, SCVI | Data integration, clustering, and evolutionary analysis [51] |
This application note provides detailed protocols for the biological validation of foundation models using spermatogenesis as a case study. The integrated approach combining single-nucleus transcriptomics, evolutionary analysis, and foundation model fine-tuning enables robust cross-species cell type annotation. The conserved yet rapidly evolving nature of spermatogenesis makes it an ideal system for testing the limits of foundation model generalization across species. These protocols establish a framework for validating foundation models in other complex biological systems, with particular relevance for translational research in male infertility and contraceptive development.
The accurate identification of cell types across species is a cornerstone of comparative biology, with profound implications for understanding evolution, developmental biology, and disease mechanisms. The emergence of single-cell foundation models (scFMs) represents a transformative approach for deciphering the "language" of cells by applying large-scale deep learning to vast single-cell transcriptomic datasets [10]. These models, pretrained on millions of cells, learn fundamental biological principles that can potentially generalize across taxonomic boundaries.
However, cross-species prediction faces significant biological and computational challenges. Studies consistently demonstrate that marker gene transferability decreases as evolutionary distance increases [54]. Research on primate embryoid bodies revealed that human marker genes were less effective in macaques and vice versa, highlighting fundamental limitations in direct annotation transfer [54]. Similarly, analysis of primate brain tissues identified that while 76% of genes showed conserved expression patterns, the remaining 24% exhibited extensive differences between human and non-human primates [55].
This Application Note examines the current capabilities and limitations of scFMs in cross-species cell annotation, with specific focus on prediction accuracy from primates to zebrafish. We provide a structured framework for evaluating model performance, detailed protocols for implementation, and practical solutions for overcoming biological divergence in translational studies.
The core challenge in cross-species prediction lies in the fundamental biological differences that accumulate over evolutionary time, manifesting at multiple molecular levels.
Genetic and Regulatory Differences: Analysis of five primate species (human, chimp, gorilla, macaque, and marmoset) revealed that 3,383 out of 14,131 genes (24%) showed extensive expression differences in homologous cell types [55]. These divergent genes were particularly associated with synaptic assembly and function, with nearly half showing expression divergence limited to glial cell types.
Marker Gene Transferability Limitations: A systematic study of embryoid bodies from four primate species demonstrated that the discriminatory power of marker genes decreases with phylogenetic distance [54]. Human marker genes proved less effective for annotating macaque cells, indicating that even between closely related species, direct annotation transfer has limitations that must be accounted for in prediction models.
While scFMs show remarkable potential, several technical constraints currently limit their cross-species applicability.
Data Quality and Integration Challenges: Single-cell data exhibits characteristics of high sparsity, high dimensionality, and low signal-to-noise ratio [56]. When integrating data across species, additional complications arise from batch effects, technical noise, and varying processing steps [10]. These issues are compounded when comparing evolutionarily distant species like primates and zebrafish.
Architectural Constraints: Most scFMs use transformer architectures that require sequential input, but gene expression data lacks natural ordering [10] [56]. Current solutions include ranking genes by expression levels or partitioning them into expression bins, but these arbitrary sequences may not capture biologically meaningful relationships conserved across species.
Table 1: Key Challenges in Cross-Species Cell Annotation
| Challenge Category | Specific Limitations | Impact on Prediction Accuracy |
|---|---|---|
| Biological Divergence | Decreasing marker gene transferability with evolutionary distance | Reduced annotation accuracy for distant species |
| Differences in gene co-expression networks (139 genes with human-specific connectivity identified) | Limited functional inference across species | |
| Variation in cell type-specific gene expression (3,383 genes with primate differences) | Incorrect cell type matching | |
| Technical Limitations | Data sparsity and high dimensionality | Increased noise in model predictions |
| Non-sequential nature of omics data | Suboptimal representation learning | |
| Batch effects across experiments and species | Confounded biological signals |
Evaluating scFM performance requires multiple metrics to capture different dimensions of prediction quality. Based on benchmarking studies, we recommend the following assessment framework.
Table 2: Performance Metrics for Cross-Species scFM Evaluation
| Metric Category | Specific Metrics | Interpretation in Cross-Species Context |
|---|---|---|
| Cell Type Annotation Accuracy | Overall accuracy, F1-score, Precision/Recall | Measures direct prediction correctness against ground truth |
| Lowest Common Ancestor Distance (LCAD) | Ontological proximity of misclassified cell types [56] | |
| scGraph-OntoRWR | Consistency with known biological relationships [56] | |
| Dataset Integration Quality | Batch integration scores | Separation of biological vs. technical variation |
| Label transfer accuracy | Effectiveness of annotation between species | |
| Biological Relevance | Gene ontology term prediction | Functional knowledge capture in embeddings |
| Tissue specificity prediction | Conservation of spatial expression patterns |
Comprehensive benchmarking of six prominent scFMs against traditional methods reveals a nuanced performance landscape. Under the realistic conditions of cross-species prediction, no single scFM consistently outperforms others across all tasks [56]. The relative performance depends heavily on specific factors including dataset size, biological complexity, and evolutionary distance between source and target species.
For evolutionarily close species (primate-to-primate annotation), scFMs demonstrate robust performance with accuracy metrics often exceeding 0.85 for well-conserved cell types [55]. However, performance degradation occurs with increasing phylogenetic distance, particularly for neuronal and immune cell types that exhibit accelerated evolutionary divergence [55].
Notably, simpler machine learning models sometimes outperform complex foundation models in specific cross-species tasks, particularly when training data is limited or computational resources are constrained [56]. This suggests that scFMs provide the greatest value when leveraging their pretrained knowledge base through transfer learning, rather than applying them in zero-shot scenarios across large evolutionary distances.
Accurate cross-species prediction requires careful identification of orthologous cell types as a foundation for model training and validation.
Procedure:
Key Considerations: This protocol requires careful handling of uneven cell type compositions between species. The interactive Shiny application (https://shiny.bio.lmu.de/CrossSpeciesCellType/) provides a practical implementation framework [54].
Effective adaptation of pretrained scFMs for cross-species prediction requires systematic fine-tuning and validation.
Procedure:
Technical Notes: Limit fine-tuning to 10-20% of original pretraining epochs to prevent catastrophic forgetting. Use batch sizes that maintain representation of both species in each update.
Implementing cross-species prediction requires specific computational tools and resources. The following table summarizes essential solutions for scFM-based cross-species annotation.
Table 3: Essential Research Reagents for Cross-Species scFM Implementation
| Reagent Category | Specific Tools/Databases | Function in Cross-Species Prediction |
|---|---|---|
| Pretrained Models | Geneformer, scGPT, UCE, scFoundation [56] | Provide base models for fine-tuning with biological knowledge |
| Data Resources | CZ CELLxGENE [10], Human Cell Atlas [10], PanglaoDB [10] | Curated single-cell data for training and validation |
| Orthology Databases | OrthoDB [55], Ensembl Compare | Map genes across species for model alignment |
| Evaluation Tools | MetaNeighbor [55], SingleR [54], scGraph-OntoRWR [56] | Validate cross-species cell type correspondence |
| Annotation Databases | Cell Ontology, Synaptic Gene Ontology (SynGO) [55] | Provide standardized vocabulary for cell types |
Cross-species prediction from primates to zebrafish represents both a formidable challenge and tremendous opportunity for advancing comparative biology and translational research. Single-cell foundation models provide a powerful framework for addressing this challenge, but their effective application requires careful consideration of biological divergence, appropriate model selection, and systematic validation.
The protocols and metrics presented here establish a foundation for rigorous cross-species prediction that acknowledges both the capabilities and limitations of current approaches. As scFM architectures evolve and incorporate multi-omic data, their ability to capture conserved biological principles across larger evolutionary distances will continue to improve, potentially bridging the gap between primate and zebrafish biology with increasing accuracy.
Future directions should focus on incorporating protein structure information [57], developing explicit models of evolutionary distance, and creating standardized cross-species benchmarking datasets. These advances will accelerate the deployment of scFMs in both basic research and drug development, where cross-species extrapolation remains a critical challenge.
Cross-species cell annotation represents a transformative approach in single-cell biology, enabling the deciphering of cellular function, development, and disease across evolutionary timescales. By leveraging foundational models trained on vast, evolutionarily diverse datasets, researchers can now identify conserved and divergent cellular states, offering unprecedented insights into fundamental biological processes and accelerating therapeutic development. This paradigm shift moves beyond single-species analysis to a unified framework for understanding cellular biology from a comparative perspective. These foundation models, trained on hundreds of millions of cells, serve as powerful virtual instruments, allowing scientists to ask complex biological questions and test in-silico hypotheses before conducting wet-lab experiments [9]. This document provides detailed application notes and protocols for employing these models to reveal novel biological insights into evolutionary conservation and divergence, framed within the broader thesis of cross-species cell annotation research.
The evolutionary conservation of core developmental programs is exemplified by microglia, the resident immune cells of the central nervous system. Across vertebrate species, microglia share a conserved origin from primitive yolk sac-derived macrophages (or analogous structures like the rostral blood island in zebrafish) that colonize the embryonic brain early in development [58]. This ontogenetic pathway is a conserved hallmark, independent of definitive bone marrow hematopoiesis.
Despite this shared origin, microglia exhibit significant functional and phenotypic divergence across species. Their morphology, gene expression profiles, and responses to stimuli vary considerably, reflecting evolutionary adaptations shaped by factors such as lifespan, regenerative capacity, and overall immune system architecture [58]. For example, in contrast to mammals, yolk sac-derived microglia in birds are transient and are largely replaced by definitive hematopoietic cells later in development [58]. This interplay between conserved origins and species-specific functions makes microglia a prime model for studying evolutionary conservation and divergence using cross-species annotation tools.
Modern foundational models, pre-trained on massive single-cell transcriptomics datasets encompassing multiple species, have demonstrated remarkable capabilities for cross-species biological discovery. The following table summarizes key insights and performance metrics from state-of-the-art models.
Table 1: Performance and Insights from Cross-Species Foundation Models
| Model Name | Training Scale | Key Demonstrated Capability | Performance Highlight | Biological Insight |
|---|---|---|---|---|
| TranscriptFormer [9] | 112 million cells from 12 species (~1.5 billion years of evolution) | Predict cell types in out-of-distribution species (e.g., rhesus macaque, marmoset) | Surpassed baseline models at identifying SARS-CoV-2-infected cells without fine-tuning [9] | Enables translation of gene expression patterns and biological mechanisms across vast evolutionary distances. |
| CellFM [35] | 100 million human cells | Gene function prediction, perturbation prediction, and cell annotation. | Outperforms existing models in gene function prediction and gene-gene relationship capturing [35] | Provides a unified model to represent cellular states, overcoming data noise and sparsity. |
| LICT [59] | Evaluated on diverse datasets (PBMCs, embryos, gastric cancer) | Objective reliability assessment for cell type annotation using multi-LLM integration. | Achieved a 48.5% full match rate with manual annotations on low-heterogeneity embryo data [59] | Addresses annotation reliability, a critical challenge in single-cell biology, especially for ambiguous cell clusters. |
These models function as a "virtual instrument" for researchers. A key application is the prompting of generative models like TranscriptFormer to simulate gene-gene interactions within specific cell types and organisms, thereby identifying co-expressed genes and predicting underlying regulatory networks [9]. Furthermore, their ability to produce contextualized gene embeddings that are cell-specific offers a more granular understanding of gene function compared to static annotations [9].
Rigorous quality control is a prerequisite for reliable cross-species annotation. This is especially critical for multi-modal data like CITE-Seq, which simultaneously measures gene expression and surface protein abundance. The CITESeQC software package provides a systematic, quantitative framework for this purpose [60]. It employs metrics like Shannon entropy to quantify the cell type-specificity of gene or protein expression across clusters, with lower entropy values indicating higher specificity. It also assesses the correlation between gene expression and the abundance of their corresponding proteins, an expected biological relationship that serves as an internal quality check [60]. This objective assessment is vital for ensuring that downstream analyses and model predictions are built upon high-quality, reliable data.
This protocol details the use of foundation models for annotating cell types across different species, which is fundamental for identifying evolutionarily conserved and divergent cell states.
Table 2: Research Reagent Solutions for Cross-Species Annotation
| Item Name | Function/Explanation |
|---|---|
| TranscriptFormer Model [9] | A generative, multi-species foundation model for single-cell transcriptomics that serves as the primary annotation engine. |
| CZ CELLxGENE / ZebraHub / Tabula Sapiens Data [9] | Curated, publicly available single-cell atlases that provide the foundational data for model training and validation. |
| CITESeQC R Package [60] | A software package for performing multi-layered, quantitative quality control on CITE-Seq (RNA + protein) data prior to analysis. |
| LICT (LLM-based Identifier for Cell Types) [59] | A tool that leverages multiple large language models to provide objective, reference-free assessment of cell annotation reliability. |
Procedure:
Cross-Species Annotation Workflow
This protocol uses foundation models to predict the transcriptomic consequences of genetic perturbations, enabling the study of gene regulatory networks across species and conditions.
Procedure:
In-Silico Perturbation Analysis
This protocol leverages multiplexed immunofluorescence to generate high-quality training labels for a deep learning model that can classify cell types directly from standard H&E-stained histopathology images, enabling scalable spatial biomarker discovery.
Procedure:
H&E Cell Classification via mIF
A fundamental challenge in computational biology is developing models that perform reliably on data from species not encountered during training, known as out-of-distribution (OOD) species. This capability is crucial for creating truly generalizable biological foundation models that can accelerate discovery across the tree of life. Recent research has revealed a significant generalization gap where models excel on familiar species but fail to maintain predictive performance when applied to evolutionarily distant organisms [62]. This application note examines the current state of OOD generalization in cross-species cell annotation models, provides experimental protocols for evaluation, and offers visualization tools to guide research in this emerging field.
Robust evaluation requires multiple metrics to assess different aspects of model generalization. The table below summarizes the primary quantitative measures used in recent studies.
Table 1: Key Performance Metrics for OOD Species Generalization
| Metric | Definition | Interpretation | Typical Performance Range |
|---|---|---|---|
| Cell Type Annotation Accuracy | Percentage of cells correctly classified in unseen species | Measures basic transfer learning capability | 40-85% depending on evolutionary distance [9] |
| Neural Predictivity Score | Correlation between model predictions and actual neuronal responses to OOD stimuli | Quantifies how well models generalize to novel visual patterns | Varies significantly across model architectures [62] |
| Disease State Prediction AUC | Area Under Curve for identifying infected/diseased cells in new species | Evaluates clinical or pathological relevance | >0.75 in state-of-the-art models [9] |
| Gene-Gene Interaction Accuracy | Precision in predicting conserved genetic interactions | Tests understanding of fundamental biological mechanisms | Higher for evolutionarily conserved pathways [9] |
Recent benchmarking studies reveal substantial differences in how various architectures handle OOD generalization. TranscriptFormer has demonstrated state-of-the-art performance, accurately identifying cell types in unseen species like rhesus macaque and marmoset without species-specific training data [9]. In comparative analyses, adversarially robust models often yield substantially higher generalization in neural predictivity, though the degree of robustness itself doesn't directly predict performance [62]. Surprisingly, performance on common computer vision OOD benchmarks does not correlate with OOD neural predictivity performance, suggesting domain-specific evaluation is essential [62].
Table 2: Model Architecture Comparison for OOD Generalization
| Model Type | OOD Cell Type Accuracy | Training Data Scale | Strengths | Limitations |
|---|---|---|---|---|
| Transformer-based (TranscriptFormer) | 70-85% for closely related species [9] | 112 million cells across 12 species [9] | Cross-species transfer, generative capabilities | Computational intensity for training [10] |
| Adversarially Robust Models | Improved but unquantified neural predictivity [62] | Varies by implementation | Resistance to synthetic OOD stimuli | Limited single-cell implementation |
| Encoder-Based (scBERT) | Moderate for in-domain species [10] | Millions of single-cell transcriptomes [10] | Effective for classification tasks | Limited generative capacity |
| Decoder-Based (scGPT) | Good interpolation, limited extrapolation [10] | Diverse single-cell corpora [10] | Strong generative performance | May learn ecologically implausible relationships [63] |
Purpose: To evaluate model performance on cell type identification in evolutionarily distant species not included in training data.
Materials:
Procedure:
Validation Methods:
Purpose: To measure how well model representations predict neural responses to novel, out-of-distribution visual stimuli.
Materials:
Procedure:
Key Considerations:
Figure 1: Single-Cell Foundation Model Architecture
Figure 2: OOD Species Evaluation Workflow
Table 3: Key Research Reagents for Cross-Species Single-Cell Studies
| Reagent/Resource | Function | Example Sources/Platforms |
|---|---|---|
| Cross-Species Cell Atlases | Training data for foundation models | CZ CELLxGENE, Tabula Sapiens, ZebraHub, Human Cell Atlas [9] |
| Orthology Mapping Databases | Gene identifier conversion across species | Ensembl Compara, OrthoDB, HGNC, MGI |
| Single-Cell Foundation Models | Base models for transfer learning | TranscriptFormer, scBERT, scGPT, scFMs [10] [9] |
| Adversarial Training Frameworks | Improving model robustness | PyTorch Adversarial, TensorFlow Robustness |
| Contrast Enhancement Networks | Image preprocessing for morphological data | FCE-Net for biomedical optical images [64] |
| Spatial Transcriptomics Data | Adding spatial context to single-cell data | 10X Genomics Visium, MERFISH, seqFISH+ |
| Cell Type Annotation Tools | Reference-based cell labeling | scPred, SingleR, SCINA [16] |
The development of models that generalize effectively to out-of-distribution species represents both a significant challenge and opportunity in computational biology. Current evidence suggests that scale and diversity of training data are crucial factors, with models like TranscriptFormer demonstrating that training on evolutionarily diverse corpora (covering 1.5 billion years of evolution) enables better generalization [9]. However, simply increasing model size or training data may be insufficient if not paired with architectural innovations specifically designed for OOD robustness.
A promising direction is the integration of adversarial training techniques, which have shown benefits for neural predictivity on OOD stimuli despite not improving performance on standard computer vision benchmarks [62]. This suggests that biological relevance requires specialized approaches beyond those developed for general computer vision tasks. Additionally, developing better tokenization strategies for non-sequential biological data remains an active research area, with current approaches including gene ranking by expression, value binning, and incorporation of biological metadata [10].
Future work should focus on standardized benchmarking for cross-species generalization, development of biologically-motivated regularization techniques, and integration of multi-modal data to provide additional constraints that could improve OOD performance. As these models mature, they hold the potential to transform comparative biology and accelerate the development of therapies tested across appropriate model organisms.
Cross-species cell annotation foundation models represent a paradigm shift in computational biology, successfully integrating evolutionary divergence with deep learning to create powerful tools for deciphering cellular function across species. The synthesis of findings reveals that models like GeneCompass, TranscriptFormer, and CAME demonstrate remarkable capabilities in transferring cell type annotations, predicting disease states, and uncovering conserved regulatory mechanisms. Future directions will involve expanding to non-model organisms, integrating multi-omic data, and enhancing model interpretability for clinical translation. For biomedical research, these models promise to accelerate drug target discovery, improve translation from model organisms to humans, and ultimately enable predictive biology across the tree of life, significantly advancing our ability to understand and treat human disease.