Table of Contents
Fetching ...

scMamba: A Pre-Trained Model for Single-Nucleus RNA Sequencing Analysis in Neurodegenerative Disorders

Gyutaek Oh, Baekgyu Choi, Seyoung Jin, Inkyung Jung, Jong Chul Ye

TL;DR

scMamba introduces a pre-trained snRNA-seq analysis model for neurodegenerative disorders that leverages selective state-space Mamba blocks and gene-aware embeddings to process full-length gene expression data without HVG reduction. Through masked expression modeling (MEM) during pre-training, the model learns generalizable cell- and gene-level representations, enabling effective fine-tuning for cell type classification, doublet detection, and imputation, with improved robustness in differential expression analyses. Across 14+ datasets and the Jung cohort, scMamba consistently outperforms baselines in fine-grained cell-type tasks, demonstrates strong imputation quality and batch-correction capabilities, and enhances DEG reproducibility in heterogeneous, postmortem brain data. This approach supports scalable, information-preserving integration of snRNA-seq data for neurodegenerative disease research and large-scale discovery.

Abstract

Single-nucleus RNA sequencing (snRNA-seq) has significantly advanced our understanding of the disease etiology of neurodegenerative disorders. However, the low quality of specimens derived from postmortem brain tissues, combined with the high variability caused by disease heterogeneity, makes it challenging to integrate snRNA-seq data from multiple sources for precise analyses. To address these challenges, we present scMamba, a pre-trained model designed to improve the quality and utility of snRNA-seq analysis, with a particular focus on neurodegenerative diseases. Inspired by the recent Mamba model, scMamba introduces a novel architecture that incorporates a linear adapter layer, gene embeddings, and bidirectional Mamba blocks, enabling efficient processing of snRNA-seq data while preserving information from the raw input. Notably, scMamba learns generalizable features of cells and genes through pre-training on snRNA-seq data, without relying on dimension reduction or selection of highly variable genes. We demonstrate that scMamba outperforms benchmark methods in various downstream tasks, including cell type annotation, doublet detection, imputation, and the identification of differentially expressed genes.

scMamba: A Pre-Trained Model for Single-Nucleus RNA Sequencing Analysis in Neurodegenerative Disorders

TL;DR

scMamba introduces a pre-trained snRNA-seq analysis model for neurodegenerative disorders that leverages selective state-space Mamba blocks and gene-aware embeddings to process full-length gene expression data without HVG reduction. Through masked expression modeling (MEM) during pre-training, the model learns generalizable cell- and gene-level representations, enabling effective fine-tuning for cell type classification, doublet detection, and imputation, with improved robustness in differential expression analyses. Across 14+ datasets and the Jung cohort, scMamba consistently outperforms baselines in fine-grained cell-type tasks, demonstrates strong imputation quality and batch-correction capabilities, and enhances DEG reproducibility in heterogeneous, postmortem brain data. This approach supports scalable, information-preserving integration of snRNA-seq data for neurodegenerative disease research and large-scale discovery.

Abstract

Single-nucleus RNA sequencing (snRNA-seq) has significantly advanced our understanding of the disease etiology of neurodegenerative disorders. However, the low quality of specimens derived from postmortem brain tissues, combined with the high variability caused by disease heterogeneity, makes it challenging to integrate snRNA-seq data from multiple sources for precise analyses. To address these challenges, we present scMamba, a pre-trained model designed to improve the quality and utility of snRNA-seq analysis, with a particular focus on neurodegenerative diseases. Inspired by the recent Mamba model, scMamba introduces a novel architecture that incorporates a linear adapter layer, gene embeddings, and bidirectional Mamba blocks, enabling efficient processing of snRNA-seq data while preserving information from the raw input. Notably, scMamba learns generalizable features of cells and genes through pre-training on snRNA-seq data, without relying on dimension reduction or selection of highly variable genes. We demonstrate that scMamba outperforms benchmark methods in various downstream tasks, including cell type annotation, doublet detection, imputation, and the identification of differentially expressed genes.

Paper Structure

This paper contains 16 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overall framework of scMamba. a. The scMamba model comprises a linear layer for expression embeddings, gene embeddings, and bidirectional Mamba blocks. During pre-training, a subset of input data is masked, and the model predicts the expression levels at the masked positions. b. For fine-tuning classification tasks, three [CLS] embeddings are inserted into the input embeddings. These embeddings are processed through the Mamba blocks and then passed to a classification head, which predicts cell classes. c. For fine-tuning the snRNA-seq imputation task, portions of the input expression levels are masked, and the model predicts the masked values. Zero and non-zero values are masked with different probabilities to ensure balanced learning. Both zero and non-zero values are masked with different masking probabilities.
  • Figure 1: Boxplots showing for the fraction of DEG overlap for six cell types after selecting half of samples within the same study. The number of cells for each cell type is shown together: Astrocytes = 5,018, Oligodendrocytes = 20,956, OPCs = 2,674, Inhibitory neurons = 705, Endothelial cells = 1,641, and Pericytes = 673. P-values were calculated using paired t-test.
  • Figure 2: a. To generate cell embeddings from the pre-trained model, snRNA-seq data is input into the model, and the resulting output features are averaged along the sequence length. b. UMAP visualization of cell embeddings from the pre-trained scMamba model. Each UMAP is colored based on 8 major cell types or 72 subtypes. c. UMAP visualization of gene embeddings from pre-trained scMamba model. Marker genes of 4 distinct cell types are labeled with names. (AC: astrocyte, MG: microglia, OL: oligodendrocyte, OPC: oligodendrocyte progenitor cell, EXN: excitatory neuron, INN: inhibitory neuron, EC: endothelial cell, PC: pericyte, NEU: neuron).
  • Figure 3: a. The dataset is labeled with 8 major cell types and 72 detailed subtypes. b. F1 score distribution across 8 cell types, with each box plot representing results for individual datasets. c. F1 score distribution across 72 subtypes, with each box plot representing results for individual datasets. d. F1 score distribution across 127 subclusters, with each box plot representing results for individual datasets.
  • Figure 4: a. Simulated doublets are generated by averaging the UMI counts of two randomly selected singlets. b. Heatmap of evaluation metric scores for in vivo doublet detection by each method across datasets. White squares indicate where the method failed to execute. c. Heatmap of evaluation metric scores for simulated doublet detection by each method across datasets.
  • ...and 4 more figures