Table of Contents
Fetching ...

Exploiting Label-Independent Regularization from Spatial Dependencies for Whole Slide Image Analysis

Weiyi Wu, Xinwen Xu, Chongyang Gao, Xingjian Diao, Siting Li, Jiang Gui

TL;DR

This work proposes a spatially regularized MIL framework that leverages inherent spatial relationships among patch features as label-independent regularization signals, and learns a shared representation space by jointly optimizing feature-induced spatial reconstruction and label-guided classification objectives.

Abstract

Whole slide images, with their gigapixel-scale panoramas of tissue samples, are pivotal for precise disease diagnosis. However, their analysis is hindered by immense data size and scarce annotations. Existing MIL methods face challenges due to the fundamental imbalance where a single bag-level label must guide the learning of numerous patch-level features. This sparse supervision makes it difficult to reliably identify discriminative patches during training, leading to unstable optimization and suboptimal solutions. We propose a spatially regularized MIL framework that leverages inherent spatial relationships among patch features as label-independent regularization signals. Our approach learns a shared representation space by jointly optimizing feature-induced spatial reconstruction and label-guided classification objectives, enforcing consistency between intrinsic structural patterns and supervisory signals. Experimental results on multiple public datasets demonstrate significant improvements over state-of-the-art methods, offering a promising direction.

Exploiting Label-Independent Regularization from Spatial Dependencies for Whole Slide Image Analysis

TL;DR

This work proposes a spatially regularized MIL framework that leverages inherent spatial relationships among patch features as label-independent regularization signals, and learns a shared representation space by jointly optimizing feature-induced spatial reconstruction and label-guided classification objectives.

Abstract

Whole slide images, with their gigapixel-scale panoramas of tissue samples, are pivotal for precise disease diagnosis. However, their analysis is hindered by immense data size and scarce annotations. Existing MIL methods face challenges due to the fundamental imbalance where a single bag-level label must guide the learning of numerous patch-level features. This sparse supervision makes it difficult to reliably identify discriminative patches during training, leading to unstable optimization and suboptimal solutions. We propose a spatially regularized MIL framework that leverages inherent spatial relationships among patch features as label-independent regularization signals. Our approach learns a shared representation space by jointly optimizing feature-induced spatial reconstruction and label-guided classification objectives, enforcing consistency between intrinsic structural patterns and supervisory signals. Experimental results on multiple public datasets demonstrate significant improvements over state-of-the-art methods, offering a promising direction.
Paper Structure (27 sections, 12 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 27 sections, 12 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: Illustration of regularization fashions in MIL. In attention maps, orange represents high-attention instances. In the classification diagrams, red and green represent the target and non-target classes, respectively. The dense dashed lines represent the decision boundary. Circles indicate bag features, squares denote instance features, and dashed circles represent bag features after dropout. (1) Enforces label consistency between high-attention instances and bags. (2) Stochastically drops instances based on attention scores. Both methods are label-driven and iteratively update the model based on classification, with the updated model generating new attention maps for the next iteration. (3) Regularizes feature spaces through masked feature reconstruction, offering a label-independent and noise-free regularization.
  • Figure 2: Overview of the SRMIL framework. The model employs dual learning streams: a feature-induced stream with self-supervised reconstruction and a label-guided classification stream. The encoder-decoder architecture processes WSI patch embeddings through graph attention networks, where random masked patch features reconstruction serves as noise-free regularization, complementing the supervised learning objective. Black lines indicate information flow from input patches through both learning streams.
  • Figure 3: Behavior Analysis of the ABMIL Model on the training set during training. (a) Training accuracy curve. (b) Ratio of WSIs containing highly attended instances (attention weights $\geq$ 0.5). (c) Instance-bag label alignment ratios for positive slides, categorizing instances as top-1 attended, top-5 attended, and highly attended. An instance is labeled positive if over 20% of its area contains annotated positive regions. Negative slides are excluded as they contain only negative instances. (d) Distribution of maximum attention weights. ABMIL exhibits a highly skewed attention pattern with maximum weights up to 1, whereas SRMIL's weights are concentrated below 0.1, indicating a more uniform attention distribution.
  • Figure 4: Visualization of ABMIL attention for a negative bag.
  • Figure 5: Visualization of the effect of different mask ratios.