Table of Contents
Fetching ...

From Pixels to Gigapixels: Bridging Local Inductive Bias and Long-Range Dependencies with Pixel-Mamba

Zhongwei Qiu, Hanqing Chao, Tiancheng Lin, Wanxing Chang, Zijiang Yang, Wenpei Jiao, Yixuan Shen, Yunshuo Zhang, Yelin Yang, Wenbin Liu, Hui Jiang, Yun Bian, Ke Yan, Dakai Jin, Le Lu

TL;DR

The paper addresses the challenge of analyzing gigapixel WSIs by introducing Pixel-Mamba, an end-to-end architecture that fuses local inductive bias with long-range dependencies through progressive token expansion and a linear-memory state-space backbone (Mamba). By serializing WSIs into pixel-level tokens and hierarchically expanding receptive fields while employing region fusion, Pixel-Mamba achieves efficient end-to-end training and strong performance across tumor staging and survival analysis tasks without pathology-specific pretraining. Empirical results show Pixel-Mamba matching or surpassing state-of-the-art foundation models pretrained on millions of WSIs or WSI-text pairs, including competitive ImageNet results, underscoring its practicality as a baseline for WSI analysis. The work highlights the value of hierarchical slide representations and end-to-end optimization for pathology, offering a scalable, memory-efficient alternative to heavy two-stage pipelines and dilated-Transformer approaches, with substantial implications for clinical workflows and AI-assisted pathology.

Abstract

Histopathology plays a critical role in medical diagnostics, with whole slide images (WSIs) offering valuable insights that directly influence clinical decision-making. However, the large size and complexity of WSIs may pose significant challenges for deep learning models, in both computational efficiency and effective representation learning. In this work, we introduce Pixel-Mamba, a novel deep learning architecture designed to efficiently handle gigapixel WSIs. Pixel-Mamba leverages the Mamba module, a state-space model (SSM) with linear memory complexity, and incorporates local inductive biases through progressively expanding tokens, akin to convolutional neural networks. This enables Pixel-Mamba to hierarchically combine both local and global information while efficiently addressing computational challenges. Remarkably, Pixel-Mamba achieves or even surpasses the quantitative performance of state-of-the-art (SOTA) foundation models that were pretrained on millions of WSIs or WSI-text pairs, in a range of tumor staging and survival analysis tasks, {\bf even without requiring any pathology-specific pretraining}. Extensive experiments demonstrate the efficacy of Pixel-Mamba as a powerful and efficient framework for end-to-end WSI analysis.

From Pixels to Gigapixels: Bridging Local Inductive Bias and Long-Range Dependencies with Pixel-Mamba

TL;DR

The paper addresses the challenge of analyzing gigapixel WSIs by introducing Pixel-Mamba, an end-to-end architecture that fuses local inductive bias with long-range dependencies through progressive token expansion and a linear-memory state-space backbone (Mamba). By serializing WSIs into pixel-level tokens and hierarchically expanding receptive fields while employing region fusion, Pixel-Mamba achieves efficient end-to-end training and strong performance across tumor staging and survival analysis tasks without pathology-specific pretraining. Empirical results show Pixel-Mamba matching or surpassing state-of-the-art foundation models pretrained on millions of WSIs or WSI-text pairs, including competitive ImageNet results, underscoring its practicality as a baseline for WSI analysis. The work highlights the value of hierarchical slide representations and end-to-end optimization for pathology, offering a scalable, memory-efficient alternative to heavy two-stage pipelines and dilated-Transformer approaches, with substantial implications for clinical workflows and AI-assisted pathology.

Abstract

Histopathology plays a critical role in medical diagnostics, with whole slide images (WSIs) offering valuable insights that directly influence clinical decision-making. However, the large size and complexity of WSIs may pose significant challenges for deep learning models, in both computational efficiency and effective representation learning. In this work, we introduce Pixel-Mamba, a novel deep learning architecture designed to efficiently handle gigapixel WSIs. Pixel-Mamba leverages the Mamba module, a state-space model (SSM) with linear memory complexity, and incorporates local inductive biases through progressively expanding tokens, akin to convolutional neural networks. This enables Pixel-Mamba to hierarchically combine both local and global information while efficiently addressing computational challenges. Remarkably, Pixel-Mamba achieves or even surpasses the quantitative performance of state-of-the-art (SOTA) foundation models that were pretrained on millions of WSIs or WSI-text pairs, in a range of tumor staging and survival analysis tasks, {\bf even without requiring any pathology-specific pretraining}. Extensive experiments demonstrate the efficacy of Pixel-Mamba as a powerful and efficient framework for end-to-end WSI analysis.

Paper Structure

This paper contains 37 sections, 2 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: (a) Pathologists integrate observations from multiple regions across different scales to make a comprehensive assessment. (b) Frameworks of mainstream WSI analysis methods: a two-stage pipeline (top) and memory-optimized ViT (bottom, often with heavily pruned attention). (c) The proposed Pixel-Mamba, an end-to-end framework that combines progressive token expansion and the Mamba module to effectively integrate local inductive biases with long-range dependencies in a hierarchical manner.
  • Figure 2: The Pixel-Mamba Framework. (a) The WSI is serialized, with CLS tokens added to create the token series $T$. (b) Pixel-Mamba progressively expands the receptive field of tokens while maintaining global context modeling. (c) Detailed illustrations of the Mamba Block, Region Fusion, and Token Expansion modules.
  • Figure 3: The illustration of Token Expansion in a region.
  • Figure 4: The comparison of Kaplan-Meier analysis and Log-Rank test (p-value, lower is best and p-value $\leq$ 0.01 indicates the statistical significance between two groups) on the BLCA dataset.