MoEMambaMIL: Structure-Aware Selective State Space Modeling for Whole-Slide Image Analysis

Dongqing Xie; Yonghuang Wu

MoEMambaMIL: Structure-Aware Selective State Space Modeling for Whole-Slide Image Analysis

Dongqing Xie, Yonghuang Wu

TL;DR

MoEMambaMIL is proposed, a structure-aware SSM framework for WSI analysis that integrates region-nested selective scanning with mixture-of-experts (MoE) modeling and organizes patch tokens into region-aware sequences that preserve spatial containment across resolutions.

Abstract

Whole-slide image (WSI) analysis is challenging due to the gigapixel scale of slides and their inherent hierarchical multi-resolution structure. Existing multiple instance learning (MIL) approaches often model WSIs as unordered collections of patches, which limits their ability to capture structured dependencies between global tissue organization and local cellular patterns. Although recent State Space Models (SSMs) enable efficient modeling of long sequences, how to structure WSI tokens to fully exploit their spatial hierarchy remains an open problem.We propose MoEMambaMIL, a structure-aware SSM framework for WSI analysis that integrates region-nested selective scanning with mixture-of-experts (MoE) modeling. Leveraging multi-resolution preprocessing, MoEMambaMIL organizes patch tokens into region-aware sequences that preserve spatial containment across resolutions. On top of this structured sequence, we decouple resolution-aware encoding and region-adaptive contextual modeling via a combination of static, resolution-specific experts and dynamic sparse experts with learned routing. This design enables efficient long-sequence modeling while promoting expert specialization across heterogeneous diagnostic patterns. Experiments demonstrate that MoEMambaMIL achieves the best performance across 9 downstream tasks.

MoEMambaMIL: Structure-Aware Selective State Space Modeling for Whole-Slide Image Analysis

TL;DR

Abstract

Paper Structure (40 sections, 26 equations, 6 figures, 6 tables, 2 algorithms)

This paper contains 40 sections, 26 equations, 6 figures, 6 tables, 2 algorithms.

Introduction
Related Work
Multiple Instance Learning for Whole-Slide Images
State Space Models in Vision
Mixture-of-Experts Architectures
Method
Problem Definition
Region-Nested Selective Scan
Static and Dynamic Experts for Conditional Computation
Static Experts for Multi-Resolution Encoding
Dynamic Experts for Region-Aware Modeling
Load Balancing Regularization
Training Objective
Experiments
Dataset and Implementations
...and 25 more sections

Figures (6)

Figure 1: Conceptual comparison between conventional MIL-based WSI modeling and the proposed MoEMambaMIL framework.
Figure 2: Overview of the proposed framework. Multi-resolution WSI patches are first organized into resolution-aware sequences and modeled by static experts to capture structural representations. A selective scan strategy is then applied to construct region-nested scans, yielding region-nested token sequences. These sequences are processed by a MoEMamba backbone with gating and routing mechanisms that dynamically dispatch tokens to Mamba experts. Finally, an attention-based MIL head aggregates token features for WSI-level prediction.
Figure 3: Performance differences ($\Delta = n - r$) between resolution-based (r) and region-nested (n) selective scanning across datasets and models. The mixed signs across metrics indicate complementary strengths of the two schemes.
Figure 4: Ablation study across different encoders (ResNet, UNI, Gigapath): Mean Performance Across Metrics.
Figure 5: Multi-resolution MIL attention visualization. For each WSI, visualizing instance-level attention weights derived from the final MIL pooling for three resolution levels (Level 0–2). Red indicates high attention, blue indicates low attention, and white denotes intermediate importance.
...and 1 more figures

MoEMambaMIL: Structure-Aware Selective State Space Modeling for Whole-Slide Image Analysis

TL;DR

Abstract

MoEMambaMIL: Structure-Aware Selective State Space Modeling for Whole-Slide Image Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (6)