From Pixels to Gigapixels: Bridging Local Inductive Bias and Long-Range Dependencies with Pixel-Mamba
Zhongwei Qiu, Hanqing Chao, Tiancheng Lin, Wanxing Chang, Zijiang Yang, Wenpei Jiao, Yixuan Shen, Yunshuo Zhang, Yelin Yang, Wenbin Liu, Hui Jiang, Yun Bian, Ke Yan, Dakai Jin, Le Lu
TL;DR
The paper addresses the challenge of analyzing gigapixel WSIs by introducing Pixel-Mamba, an end-to-end architecture that fuses local inductive bias with long-range dependencies through progressive token expansion and a linear-memory state-space backbone (Mamba). By serializing WSIs into pixel-level tokens and hierarchically expanding receptive fields while employing region fusion, Pixel-Mamba achieves efficient end-to-end training and strong performance across tumor staging and survival analysis tasks without pathology-specific pretraining. Empirical results show Pixel-Mamba matching or surpassing state-of-the-art foundation models pretrained on millions of WSIs or WSI-text pairs, including competitive ImageNet results, underscoring its practicality as a baseline for WSI analysis. The work highlights the value of hierarchical slide representations and end-to-end optimization for pathology, offering a scalable, memory-efficient alternative to heavy two-stage pipelines and dilated-Transformer approaches, with substantial implications for clinical workflows and AI-assisted pathology.
Abstract
Histopathology plays a critical role in medical diagnostics, with whole slide images (WSIs) offering valuable insights that directly influence clinical decision-making. However, the large size and complexity of WSIs may pose significant challenges for deep learning models, in both computational efficiency and effective representation learning. In this work, we introduce Pixel-Mamba, a novel deep learning architecture designed to efficiently handle gigapixel WSIs. Pixel-Mamba leverages the Mamba module, a state-space model (SSM) with linear memory complexity, and incorporates local inductive biases through progressively expanding tokens, akin to convolutional neural networks. This enables Pixel-Mamba to hierarchically combine both local and global information while efficiently addressing computational challenges. Remarkably, Pixel-Mamba achieves or even surpasses the quantitative performance of state-of-the-art (SOTA) foundation models that were pretrained on millions of WSIs or WSI-text pairs, in a range of tumor staging and survival analysis tasks, {\bf even without requiring any pathology-specific pretraining}. Extensive experiments demonstrate the efficacy of Pixel-Mamba as a powerful and efficient framework for end-to-end WSI analysis.
