Table of Contents
Fetching ...

Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity

Cong Qi, Hanzhang Fang, Tianxing Hu, Siqi Jiang, Wei Zhi

TL;DR

GeneMamba introduces a bidirectional Mamba-based foundation model for single-cell RNA-seq that achieves linear-time context learning and scales to ultra-long gene expression sequences. By combining a Rank Module for tokenization, a Bi-Mamba backbone with forward and reverse processing, and a joint language-like and pathway-aware pretraining objective, the model demonstrates strong performance in multi-batch integration, cell type annotation, and gene-gene relationship tasks. Its results show improved batch mixing, robust cell-type classification, and enhanced conservation of biological structure, with interpretability advantages over transformer baselines. The work highlights practical, scalable tooling for large-scale single-cell analysis, while noting computational resource requirements and avenues for future efficiency improvements.

Abstract

Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, but its complexity, which is marked by high dimensionality, sparsity, and batch effects, which poses major computational challenges. Transformer-based models have made significant advances in this domain but are often limited by their quadratic complexity and suboptimal handling of long-range dependencies. In this work, we introduce GeneMamba, a scalable and efficient foundation model for single-cell transcriptomics built on state space modeling. Leveraging the Bi-Mamba architecture, GeneMamba captures bidirectional gene context with linear-time complexity, offering substantial computational gains over transformer baselines. The model is pretrained on nearly 30 million cells and incorporates biologically informed objectives, including pathway-aware contrastive loss and rank-based gene encoding. We evaluate GeneMamba across diverse tasks, including multi-batch integration, cell type annotation, and gene-gene correlation, demonstrating strong performance, interpretability, and robustness. These results position GeneMamba as a practical and powerful alternative to transformer-based methods, advancing the development of biologically grounded, scalable tools for large-scale single-cell data analysis.

Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity

TL;DR

GeneMamba introduces a bidirectional Mamba-based foundation model for single-cell RNA-seq that achieves linear-time context learning and scales to ultra-long gene expression sequences. By combining a Rank Module for tokenization, a Bi-Mamba backbone with forward and reverse processing, and a joint language-like and pathway-aware pretraining objective, the model demonstrates strong performance in multi-batch integration, cell type annotation, and gene-gene relationship tasks. Its results show improved batch mixing, robust cell-type classification, and enhanced conservation of biological structure, with interpretability advantages over transformer baselines. The work highlights practical, scalable tooling for large-scale single-cell analysis, while noting computational resource requirements and avenues for future efficiency improvements.

Abstract

Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, but its complexity, which is marked by high dimensionality, sparsity, and batch effects, which poses major computational challenges. Transformer-based models have made significant advances in this domain but are often limited by their quadratic complexity and suboptimal handling of long-range dependencies. In this work, we introduce GeneMamba, a scalable and efficient foundation model for single-cell transcriptomics built on state space modeling. Leveraging the Bi-Mamba architecture, GeneMamba captures bidirectional gene context with linear-time complexity, offering substantial computational gains over transformer baselines. The model is pretrained on nearly 30 million cells and incorporates biologically informed objectives, including pathway-aware contrastive loss and rank-based gene encoding. We evaluate GeneMamba across diverse tasks, including multi-batch integration, cell type annotation, and gene-gene correlation, demonstrating strong performance, interpretability, and robustness. These results position GeneMamba as a practical and powerful alternative to transformer-based methods, advancing the development of biologically grounded, scalable tools for large-scale single-cell data analysis.

Paper Structure

This paper contains 26 sections, 31 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: The GeneMamba architecture and its downstream task applications. The framework begins with the collection of training data (approximately 50M cells) from CELLXGENE, encompassing a diverse array of tissues and organs. After preprocessing, the data is prepared through a Gene Rank Module to transform single-cell data into input sequences. The GeneMamba module then captures the contextual information within each single cell. Once pretrained, the model and its embeddings are employed for various downstream tasks to evaluate the model's performance.
  • Figure 2: The schematic overview of BiMamba Block. The BiMamba Block processes input sequences bidirectionally, capturing forward and reverse context through shared convolutional layers (Conv) and structured state machines (SSM). A gating mechanism integrates the outputs, followed by linear projection and nonlinearity layers, generating a context-aware representation for downstream tasks.
  • Figure 3: Results of multi-batch integration. Benchmark of the fine-tuned GeneMamba on the PBMC 12k dataset for the multi-batch integration task. The UMAP plot of learned cell embeddings is colored by cell types.
  • Figure 4: Gene rank reconstruct results on PBMC12k dataset. (a) Venn diagrams showing overlapping between input and output tokens in the pancreas dataset by three models: GeneMamba_U (unidirectional Mamba module as backbone), GeneFormer, GeneMamba (BiMamba module as backbone). (b) Density plots showing input and output ranking in the pancreas dataset by three models: GeneMamba_U, GeneFormer, GeneMamba.
  • Figure 5: Cell type distribution in the original and modified Myeloid datasets
  • ...and 10 more figures