Table of Contents
Fetching ...

MoEMambaMIL: Structure-Aware Selective State Space Modeling for Whole-Slide Image Analysis

Dongqing Xie, Yonghuang Wu

TL;DR

MoEMambaMIL is proposed, a structure-aware SSM framework for WSI analysis that integrates region-nested selective scanning with mixture-of-experts (MoE) modeling and organizes patch tokens into region-aware sequences that preserve spatial containment across resolutions.

Abstract

Whole-slide image (WSI) analysis is challenging due to the gigapixel scale of slides and their inherent hierarchical multi-resolution structure. Existing multiple instance learning (MIL) approaches often model WSIs as unordered collections of patches, which limits their ability to capture structured dependencies between global tissue organization and local cellular patterns. Although recent State Space Models (SSMs) enable efficient modeling of long sequences, how to structure WSI tokens to fully exploit their spatial hierarchy remains an open problem.We propose MoEMambaMIL, a structure-aware SSM framework for WSI analysis that integrates region-nested selective scanning with mixture-of-experts (MoE) modeling. Leveraging multi-resolution preprocessing, MoEMambaMIL organizes patch tokens into region-aware sequences that preserve spatial containment across resolutions. On top of this structured sequence, we decouple resolution-aware encoding and region-adaptive contextual modeling via a combination of static, resolution-specific experts and dynamic sparse experts with learned routing. This design enables efficient long-sequence modeling while promoting expert specialization across heterogeneous diagnostic patterns. Experiments demonstrate that MoEMambaMIL achieves the best performance across 9 downstream tasks.

MoEMambaMIL: Structure-Aware Selective State Space Modeling for Whole-Slide Image Analysis

TL;DR

MoEMambaMIL is proposed, a structure-aware SSM framework for WSI analysis that integrates region-nested selective scanning with mixture-of-experts (MoE) modeling and organizes patch tokens into region-aware sequences that preserve spatial containment across resolutions.

Abstract

Whole-slide image (WSI) analysis is challenging due to the gigapixel scale of slides and their inherent hierarchical multi-resolution structure. Existing multiple instance learning (MIL) approaches often model WSIs as unordered collections of patches, which limits their ability to capture structured dependencies between global tissue organization and local cellular patterns. Although recent State Space Models (SSMs) enable efficient modeling of long sequences, how to structure WSI tokens to fully exploit their spatial hierarchy remains an open problem.We propose MoEMambaMIL, a structure-aware SSM framework for WSI analysis that integrates region-nested selective scanning with mixture-of-experts (MoE) modeling. Leveraging multi-resolution preprocessing, MoEMambaMIL organizes patch tokens into region-aware sequences that preserve spatial containment across resolutions. On top of this structured sequence, we decouple resolution-aware encoding and region-adaptive contextual modeling via a combination of static, resolution-specific experts and dynamic sparse experts with learned routing. This design enables efficient long-sequence modeling while promoting expert specialization across heterogeneous diagnostic patterns. Experiments demonstrate that MoEMambaMIL achieves the best performance across 9 downstream tasks.
Paper Structure (40 sections, 26 equations, 6 figures, 6 tables, 2 algorithms)

This paper contains 40 sections, 26 equations, 6 figures, 6 tables, 2 algorithms.

Figures (6)

  • Figure 1: Conceptual comparison between conventional MIL-based WSI modeling and the proposed MoEMambaMIL framework.
  • Figure 2: Overview of the proposed framework. Multi-resolution WSI patches are first organized into resolution-aware sequences and modeled by static experts to capture structural representations. A selective scan strategy is then applied to construct region-nested scans, yielding region-nested token sequences. These sequences are processed by a MoEMamba backbone with gating and routing mechanisms that dynamically dispatch tokens to Mamba experts. Finally, an attention-based MIL head aggregates token features for WSI-level prediction.
  • Figure 3: Performance differences ($\Delta = n - r$) between resolution-based (r) and region-nested (n) selective scanning across datasets and models. The mixed signs across metrics indicate complementary strengths of the two schemes.
  • Figure 4: Ablation study across different encoders (ResNet, UNI, Gigapath): Mean Performance Across Metrics.
  • Figure 5: Multi-resolution MIL attention visualization. For each WSI, visualizing instance-level attention weights derived from the final MIL pooling for three resolution levels (Level 0–2). Red indicates high attention, blue indicates low attention, and white denotes intermediate importance.
  • ...and 1 more figures