scMamba: A Pre-Trained Model for Single-Nucleus RNA Sequencing Analysis in Neurodegenerative Disorders
Gyutaek Oh, Baekgyu Choi, Seyoung Jin, Inkyung Jung, Jong Chul Ye
TL;DR
scMamba introduces a pre-trained snRNA-seq analysis model for neurodegenerative disorders that leverages selective state-space Mamba blocks and gene-aware embeddings to process full-length gene expression data without HVG reduction. Through masked expression modeling (MEM) during pre-training, the model learns generalizable cell- and gene-level representations, enabling effective fine-tuning for cell type classification, doublet detection, and imputation, with improved robustness in differential expression analyses. Across 14+ datasets and the Jung cohort, scMamba consistently outperforms baselines in fine-grained cell-type tasks, demonstrates strong imputation quality and batch-correction capabilities, and enhances DEG reproducibility in heterogeneous, postmortem brain data. This approach supports scalable, information-preserving integration of snRNA-seq data for neurodegenerative disease research and large-scale discovery.
Abstract
Single-nucleus RNA sequencing (snRNA-seq) has significantly advanced our understanding of the disease etiology of neurodegenerative disorders. However, the low quality of specimens derived from postmortem brain tissues, combined with the high variability caused by disease heterogeneity, makes it challenging to integrate snRNA-seq data from multiple sources for precise analyses. To address these challenges, we present scMamba, a pre-trained model designed to improve the quality and utility of snRNA-seq analysis, with a particular focus on neurodegenerative diseases. Inspired by the recent Mamba model, scMamba introduces a novel architecture that incorporates a linear adapter layer, gene embeddings, and bidirectional Mamba blocks, enabling efficient processing of snRNA-seq data while preserving information from the raw input. Notably, scMamba learns generalizable features of cells and genes through pre-training on snRNA-seq data, without relying on dimension reduction or selection of highly variable genes. We demonstrate that scMamba outperforms benchmark methods in various downstream tasks, including cell type annotation, doublet detection, imputation, and the identification of differentially expressed genes.
