Table of Contents
Fetching ...

ASMIL: Attention-Stabilized Multiple Instance Learning for Whole Slide Imaging

Linfeng Ye, Shayan Mohajer Hamidi, Zhixiang Chi, Guang Li, Mert Pilanci, Takahiro Ogawa, Miki Haseyama, Konstantinos N. Plataniotis

TL;DR

Attention-stabilized multiple instance learning (ASMIL), a novel unified framework that uses an anchor model to stabilize attention, replaces softmax with a normalized sigmoid function in the anchor to prevent over-concentration, and applies token random dropping to mitigate overfitting.

Abstract

Attention-based multiple instance learning (MIL) has emerged as a powerful framework for whole slide image (WSI) diagnosis, leveraging attention to aggregate instance-level features into bag-level predictions. Despite this success, we find that such methods exhibit a new failure mode: unstable attention dynamics. Across four representative attention-based MIL methods and two public WSI datasets, we observe that attention distributions oscillate across epochs rather than converging to a consistent pattern, degrading performance. This instability adds to two previously reported challenges: overfitting and over-concentrated attention distribution. To simultaneously overcome these three limitations, we introduce attention-stabilized multiple instance learning (ASMIL), a novel unified framework. ASMIL uses an anchor model to stabilize attention, replaces softmax with a normalized sigmoid function in the anchor to prevent over-concentration, and applies token random dropping to mitigate overfitting. Extensive experiments demonstrate that ASMIL achieves up to a 6.49\% F1 score improvement over state-of-the-art methods. Moreover, integrating the anchor model and normalized sigmoid into existing attention-based MIL methods consistently boosts their performance, with F1 score gains up to 10.73\%. All code and data are publicly available at https://github.com/Linfeng-Ye/ASMIL.

ASMIL: Attention-Stabilized Multiple Instance Learning for Whole Slide Imaging

TL;DR

Attention-stabilized multiple instance learning (ASMIL), a novel unified framework that uses an anchor model to stabilize attention, replaces softmax with a normalized sigmoid function in the anchor to prevent over-concentration, and applies token random dropping to mitigate overfitting.

Abstract

Attention-based multiple instance learning (MIL) has emerged as a powerful framework for whole slide image (WSI) diagnosis, leveraging attention to aggregate instance-level features into bag-level predictions. Despite this success, we find that such methods exhibit a new failure mode: unstable attention dynamics. Across four representative attention-based MIL methods and two public WSI datasets, we observe that attention distributions oscillate across epochs rather than converging to a consistent pattern, degrading performance. This instability adds to two previously reported challenges: overfitting and over-concentrated attention distribution. To simultaneously overcome these three limitations, we introduce attention-stabilized multiple instance learning (ASMIL), a novel unified framework. ASMIL uses an anchor model to stabilize attention, replaces softmax with a normalized sigmoid function in the anchor to prevent over-concentration, and applies token random dropping to mitigate overfitting. Extensive experiments demonstrate that ASMIL achieves up to a 6.49\% F1 score improvement over state-of-the-art methods. Moreover, integrating the anchor model and normalized sigmoid into existing attention-based MIL methods consistently boosts their performance, with F1 score gains up to 10.73\%. All code and data are publicly available at https://github.com/Linfeng-Ye/ASMIL.
Paper Structure (56 sections, 1 theorem, 40 equations, 26 figures, 19 tables)

This paper contains 56 sections, 1 theorem, 40 equations, 26 figures, 19 tables.

Key Result

Theorem 1

Fix $\tau>0$, $\gamma\ge 0$, and index sets $\mathcal{H},\mathcal{L}$ with $h\ge 1$, $\ell\ge 1$. For any $\boldsymbol{z}\in \mathcal{S}(\tau,\gamma,\mathcal{H},\mathcal{L})$: (A) NSF bounds. For any $i,h'\in\mathcal{H}$ and any $j\in\mathcal{L}$, Hence, NSF equalizes the high tokens up to a factor $1+e^{-\tau}$ and suppresses lows to at most $e^{-\tau}/h$. As $\tau\to\infty$ with fixed $\gamma$,

Figures (26)

  • Figure 1: Visualization of attention dynamics on a tumor WSI for TransMIL shao2021transmil vs. ASMIL (our method). The green contours in the figures indicate the annotated tumor regions. Top: TransMIL attention distribution at selected training iterations. Middle: Jensen-Shannon divergence (JSD) between attention distributions at successive steps and the cross entropy loss (CE), comparing TransMIL (blue) and ASMIL (red). Bottom: Attention distribution from ASMIL over different training iterations. Due to the weakly supervised nature of WSI subtyping datasets, TransMIL’s attention patterns never converge during training, further, it focuses on only a subset of cancerous regions. In contrast, our method ($i$) produces stable attention distributions throughout training and ($ii$) consistently highlights cancerous regions.
  • Figure 2: Overview of ASMIL. Each WSI is divided into tiles and embedded into vision tokens using a pretrained encoder. These tokens, along with trainable FEAT tokens, feed into both online and anchor encoders. The anchor encoder’s attention scores over the FEAT tokens are transformed into a probability vector using an NSF, while the online encoder applies a softmax. To stabilize training and prevent the online model’s attention from becoming overly concentrated, we compute the KL divergence between the two distributions. Gradients are blocked to the anchor encoder using a stop-gradient (sg) operator, and its parameters are updated via EMA from the online encoder. During training, we randomly drop (rd) $N$ FEAT tokens, feed the remaining tokens into a second transformer with a trainable [CLS] token, and train a classifier on its output. and indicate learnable and frozen components, respectively.
  • Figure 3: (a) Distribution of attention scores in ABMIL, which exhibits a long-tailed pattern. (b) Attention distribution obtained with the softmax function and (c) with the NSF. Unlike softmax, the normalized sigmoid suppresses large values in the long tail, yielding a less sparse and more interpretable attention distribution.
  • Figure 4: Visual comparison of attention maps on the CAMELYON-16 dataset. The left column shows the original WSI with ground-truth tumor annotations outlined in red; the remaining columns present attention maps for ASMIL (ours), DTFD-MIL, CAMIL, and TransMIL (left to right).
  • Figure 5: T-SNE embeddings of ASMIL bag-level features on the BRACS training set across training epochs. Top: with the anchor model; Bottom: without the anchor model.
  • ...and 21 more figures

Theorems & Definitions (6)

  • Remark 1
  • Theorem 1: NSF achieves selective flattening; softmax cannot with a single $T$
  • Remark 2
  • proof
  • Remark 3: Middle scores
  • proof