From Correlation to Causation: Max-Pooling-Based Multi-Instance Learning Leads to More Robust Whole Slide Image Classification

Xin Liu; Weijia Zhang; Wei Tang; Thuc Duy Le; Jiuyong Li; Lin Liu; Min-Ling Zhang

From Correlation to Causation: Max-Pooling-Based Multi-Instance Learning Leads to More Robust Whole Slide Image Classification

Xin Liu, Weijia Zhang, Wei Tang, Thuc Duy Le, Jiuyong Li, Lin Liu, Min-Ling Zhang

TL;DR

The paper analyzes attention-based MIL's vulnerability to spurious correlations in whole-slide image analysis and reframes max-pooling MIL through a causal lens. It demonstrates that, under mild assumptions, max-pooling can isolate causal content factors while ignoring environmental biases, but existing max-pooling models suffer from rote memorization and instability with hard instances. To address this, it introduces FocusMIL, which couples variational information bottleneck regularization with a multi-slide mini-batch training strategy to stabilize training and suppress memorization. Empirical results on real and semi-synthetic datasets show that FocusMIL achieves superior out-of-distribution generalization and more accurate instance-level tumor localization, validating the proposed causal perspective and practical robustness.

Abstract

In whole slide images (WSIs) analysis, attention-based multi-instance learning (MIL) models are susceptible to spurious correlations and degrade under domain shift. These methods may assign high attention weights to non-tumor regions, such as staining biases or artifacts, leading to unreliable tumor region localization. In this paper, we revisit max-pooling-based MIL methods from a causal perspective. Under mild assumptions, our theoretical results demonstrate that max-pooling encourages the model to focus on causal factors while ignoring bias-related factors. Furthermore, we discover that existing max-pooling-based methods may overfit the training set through rote memorization of instance features and fail to learn meaningful patterns. To address these issues, we propose FocusMIL, which couples max-pooling with an instance-level variational information bottleneck (VIB) to learn compact, predictive latent representations, and employs a multi-bag mini-batch scheme to stabilize optimization. We conduct comprehensive experiments on three real-world datasets and one semi-synthetic dataset. The results show that, by capturing causal factors, FocusMIL exhibits significant advantages in out-of-distribution scenarios and instance-level tumor region localization tasks.

From Correlation to Causation: Max-Pooling-Based Multi-Instance Learning Leads to More Robust Whole Slide Image Classification

TL;DR

Abstract

Paper Structure (49 sections, 10 theorems, 35 equations, 6 figures, 10 tables)

This paper contains 49 sections, 10 theorems, 35 equations, 6 figures, 10 tables.

Introduction
Related Work
Attention-based MIL for WSI classification
Reproducibility and Learnability in Multiple Instance Learning
Revisiting Max-Pooling-Based MIL: A Causal Perspective
Multi-Instance Learning (MIL)
A Causal Perspective on Max-Pooling MIL
Formal Framework and Assumptions
Discussion of Assumptions.
Theoretical Analysis
Why Existing Max-Pooling-Based Models Underperform?
FocusMIL
Variational Information Bottleneck Regularization.
Multi-Slide Mini-Batch Gradient Descent for WSI Classification.
Experiments
...and 34 more sections

Key Result

Lemma 1

During a gradient update step, the score of the argmax instance in each bag moves in the direction consistent with the bag label:

Figures (6)

Figure 1: (a) A causal graph illustrating the generative procedure of instances and their labels. Ideally, $\bm{z}_c$ is used to predict $y$. (b) The architecture of attention-based MIL (excluding the bag feature for brevity), where the classifier can leverage any of $\bm{z}_c$ and/or $\bm{z}_e$ for prediction.
Figure 2: Overview of our FocusMIL with 2 slides in a mini-batch for illustration. Features are extracted from patches using a pre-trained feature extractor. The instance features are encoded into latent representations via a stochastic encoder. The classifier $P(Y|\bm{z})$ uses the inferred latent representations $\boldsymbol{z}$ to obtain patch-level prediction scores. Max-pooling is applied to the instance predictions of each MIL bag to derive the slide-level prediction.
Figure 3: Visualization of FocusMIL and DTFD-MIL on Camelyon16 dataset. Ours(3) and Ours(1) refer to the FocusMIL models trained with a batch size of 3 and 1. The fourth slide is negative.
Figure 4: Training and validation metrics over epochs for the mi-Net model on the Camelyon16 dataset.
Figure 5: Visualization results of FocusMIL and DTFD-MIL on 3 normal slides. In the zoomed-in areas, some regions are relatively clear, while the rest are very blurred. DTFD-MIL may assign a high positive probability to the blurred regions.
...and 1 more figures

Theorems & Definitions (17)

Lemma 1: Optimization Direction
Lemma 2: Gradient Locality
Lemma 3: Unlearnability of Negative Concepts
Theorem 1: Environmental Robustness
Theorem 2: Focusing on Causal Factors
Lemma 4: Optimization Direction
proof
Lemma 5: Gradient Locality
proof
Definition 1: Non-Causal Factor as Negative Label Signal
...and 7 more

From Correlation to Causation: Max-Pooling-Based Multi-Instance Learning Leads to More Robust Whole Slide Image Classification

TL;DR

Abstract

From Correlation to Causation: Max-Pooling-Based Multi-Instance Learning Leads to More Robust Whole Slide Image Classification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (17)