ASMIL: Attention-Stabilized Multiple Instance Learning for Whole Slide Imaging

Linfeng Ye; Shayan Mohajer Hamidi; Zhixiang Chi; Guang Li; Mert Pilanci; Takahiro Ogawa; Miki Haseyama; Konstantinos N. Plataniotis

ASMIL: Attention-Stabilized Multiple Instance Learning for Whole Slide Imaging

Linfeng Ye, Shayan Mohajer Hamidi, Zhixiang Chi, Guang Li, Mert Pilanci, Takahiro Ogawa, Miki Haseyama, Konstantinos N. Plataniotis

TL;DR

Attention-stabilized multiple instance learning (ASMIL), a novel unified framework that uses an anchor model to stabilize attention, replaces softmax with a normalized sigmoid function in the anchor to prevent over-concentration, and applies token random dropping to mitigate overfitting.

Abstract

Attention-based multiple instance learning (MIL) has emerged as a powerful framework for whole slide image (WSI) diagnosis, leveraging attention to aggregate instance-level features into bag-level predictions. Despite this success, we find that such methods exhibit a new failure mode: unstable attention dynamics. Across four representative attention-based MIL methods and two public WSI datasets, we observe that attention distributions oscillate across epochs rather than converging to a consistent pattern, degrading performance. This instability adds to two previously reported challenges: overfitting and over-concentrated attention distribution. To simultaneously overcome these three limitations, we introduce attention-stabilized multiple instance learning (ASMIL), a novel unified framework. ASMIL uses an anchor model to stabilize attention, replaces softmax with a normalized sigmoid function in the anchor to prevent over-concentration, and applies token random dropping to mitigate overfitting. Extensive experiments demonstrate that ASMIL achieves up to a 6.49\% F1 score improvement over state-of-the-art methods. Moreover, integrating the anchor model and normalized sigmoid into existing attention-based MIL methods consistently boosts their performance, with F1 score gains up to 10.73\%. All code and data are publicly available at https://github.com/Linfeng-Ye/ASMIL.

ASMIL: Attention-Stabilized Multiple Instance Learning for Whole Slide Imaging

TL;DR

Abstract

Paper Structure (56 sections, 1 theorem, 40 equations, 26 figures, 19 tables)

This paper contains 56 sections, 1 theorem, 40 equations, 26 figures, 19 tables.

Introduction
Related Work
Preliminaries and Motivation
Notation
Multiple Instance Learning with Attention
Motivation
METHODOLOGY
Stabilizing Attention distributions via an Anchor Model
Preventing Attention Concentration with NSF in the Anchor Model
Mitigating Overfitting with Token Random Dropping
Overall Training Objective
Experiments
Subtyping Performance
Integrating the Anchor Model and NSF with Other MIL Methods
Localization
...and 41 more sections

Key Result

Theorem 1

Fix $\tau>0$, $\gamma\ge 0$, and index sets $\mathcal{H},\mathcal{L}$ with $h\ge 1$, $\ell\ge 1$. For any $\boldsymbol{z}\in \mathcal{S}(\tau,\gamma,\mathcal{H},\mathcal{L})$: (A) NSF bounds. For any $i,h'\in\mathcal{H}$ and any $j\in\mathcal{L}$, Hence, NSF equalizes the high tokens up to a factor $1+e^{-\tau}$ and suppresses lows to at most $e^{-\tau}/h$. As $\tau\to\infty$ with fixed $\gamma$,

Figures (26)

Figure 1: Visualization of attention dynamics on a tumor WSI for TransMIL shao2021transmil vs. ASMIL (our method). The green contours in the figures indicate the annotated tumor regions. Top: TransMIL attention distribution at selected training iterations. Middle: Jensen-Shannon divergence (JSD) between attention distributions at successive steps and the cross entropy loss (CE), comparing TransMIL (blue) and ASMIL (red). Bottom: Attention distribution from ASMIL over different training iterations. Due to the weakly supervised nature of WSI subtyping datasets, TransMIL’s attention patterns never converge during training, further, it focuses on only a subset of cancerous regions. In contrast, our method ($i$) produces stable attention distributions throughout training and ($ii$) consistently highlights cancerous regions.
Figure 2: Overview of ASMIL. Each WSI is divided into tiles and embedded into vision tokens using a pretrained encoder. These tokens, along with trainable FEAT tokens, feed into both online and anchor encoders. The anchor encoder’s attention scores over the FEAT tokens are transformed into a probability vector using an NSF, while the online encoder applies a softmax. To stabilize training and prevent the online model’s attention from becoming overly concentrated, we compute the KL divergence between the two distributions. Gradients are blocked to the anchor encoder using a stop-gradient (sg) operator, and its parameters are updated via EMA from the online encoder. During training, we randomly drop (rd) $N$ FEAT tokens, feed the remaining tokens into a second transformer with a trainable [CLS] token, and train a classifier on its output. and indicate learnable and frozen components, respectively.
Figure 3: (a) Distribution of attention scores in ABMIL, which exhibits a long-tailed pattern. (b) Attention distribution obtained with the softmax function and (c) with the NSF. Unlike softmax, the normalized sigmoid suppresses large values in the long tail, yielding a less sparse and more interpretable attention distribution.
Figure 4: Visual comparison of attention maps on the CAMELYON-16 dataset. The left column shows the original WSI with ground-truth tumor annotations outlined in red; the remaining columns present attention maps for ASMIL (ours), DTFD-MIL, CAMIL, and TransMIL (left to right).
Figure 5: T-SNE embeddings of ASMIL bag-level features on the BRACS training set across training epochs. Top: with the anchor model; Bottom: without the anchor model.
...and 21 more figures

Theorems & Definitions (6)

Remark 1
Theorem 1: NSF achieves selective flattening; softmax cannot with a single $T$
Remark 2
proof
Remark 3: Middle scores
proof

ASMIL: Attention-Stabilized Multiple Instance Learning for Whole Slide Imaging

TL;DR

Abstract

ASMIL: Attention-Stabilized Multiple Instance Learning for Whole Slide Imaging

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (26)

Theorems & Definitions (6)