Table of Contents
Fetching ...

Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology

Wenhao Tang, Rong Qin, Heng Fang, Fengtao Zhou, Hao Chen, Xiang Li, Ming-Ming Cheng

TL;DR

This paper revisits end-to-end slide-level learning for computational pathology by diagnosing optimization challenges caused by sparse-attention MIL and proposing ABMILX, which combines multi-head local attention with a global attention plus module and a multi-scale sampling pipeline. ABMILX mitigates optimization risks, enabling effective encoder fine-tuning within an end-to-end framework and delivering performance on par with foundation-model–driven two-stage approaches at substantially lower computational cost. Across diverse tasks (grading, subtyping, survival) and external validation, the method demonstrates strong generalization and efficiency, challenging the notion that large pretraining is essential for SOTA CPath performance. The results advocate for greater investment in E2E learning for WSIs and MIL, with ABMILX providing a scalable, task-adaptive path forward.

Abstract

Pre-trained encoders for offline feature extraction followed by multiple instance learning (MIL) aggregators have become the dominant paradigm in computational pathology (CPath), benefiting cancer diagnosis and prognosis. However, performance limitations arise from the absence of encoder fine-tuning for downstream tasks and disjoint optimization with MIL. While slide-level supervised end-to-end (E2E) learning is an intuitive solution to this issue, it faces challenges such as high computational demands and suboptimal results. These limitations motivate us to revisit E2E learning. We argue that prior work neglects inherent E2E optimization challenges, leading to performance disparities compared to traditional two-stage methods. In this paper, we pioneer the elucidation of optimization challenge caused by sparse-attention MIL and propose a novel MIL called ABMILX. It mitigates this problem through global correlation-based attention refinement and multi-head mechanisms. With the efficient multi-scale random patch sampling strategy, an E2E trained ResNet with ABMILX surpasses SOTA foundation models under the two-stage paradigm across multiple challenging benchmarks, while remaining computationally efficient (<10 RTX3090 hours). We show the potential of E2E learning in CPath and calls for greater research focus in this area. The code is https://github.com/DearCaat/E2E-WSI-ABMILX.

Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology

TL;DR

This paper revisits end-to-end slide-level learning for computational pathology by diagnosing optimization challenges caused by sparse-attention MIL and proposing ABMILX, which combines multi-head local attention with a global attention plus module and a multi-scale sampling pipeline. ABMILX mitigates optimization risks, enabling effective encoder fine-tuning within an end-to-end framework and delivering performance on par with foundation-model–driven two-stage approaches at substantially lower computational cost. Across diverse tasks (grading, subtyping, survival) and external validation, the method demonstrates strong generalization and efficiency, challenging the notion that large pretraining is essential for SOTA CPath performance. The results advocate for greater investment in E2E learning for WSIs and MIL, with ABMILX providing a scalable, task-adaptive path forward.

Abstract

Pre-trained encoders for offline feature extraction followed by multiple instance learning (MIL) aggregators have become the dominant paradigm in computational pathology (CPath), benefiting cancer diagnosis and prognosis. However, performance limitations arise from the absence of encoder fine-tuning for downstream tasks and disjoint optimization with MIL. While slide-level supervised end-to-end (E2E) learning is an intuitive solution to this issue, it faces challenges such as high computational demands and suboptimal results. These limitations motivate us to revisit E2E learning. We argue that prior work neglects inherent E2E optimization challenges, leading to performance disparities compared to traditional two-stage methods. In this paper, we pioneer the elucidation of optimization challenge caused by sparse-attention MIL and propose a novel MIL called ABMILX. It mitigates this problem through global correlation-based attention refinement and multi-head mechanisms. With the efficient multi-scale random patch sampling strategy, an E2E trained ResNet with ABMILX surpasses SOTA foundation models under the two-stage paradigm across multiple challenging benchmarks, while remaining computationally efficient (<10 RTX3090 hours). We show the potential of E2E learning in CPath and calls for greater research focus in this area. The code is https://github.com/DearCaat/E2E-WSI-ABMILX.

Paper Structure

This paper contains 30 sections, 30 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: (a,b) We compare E2E trained ResNet with various foundation models using two-stage paradigm in terms of performance, model size, and pretraining data. This demonstrates the performance potential of E2E learning for computational pathology under low computational budget. (c) Compared to sampling strategies, different MILs have a more significant impact and lower cost on E2E learning.
  • Figure 2: In E2E learning, MIL can be viewed as an soft instance selector that iteratively optimizes with the encoder. The encoder outputs instance features to MIL for attention-based aggregation and receives the instance gradients for optimization. The attention from MIL affects the gradients of different instance features, leading to selective learning of patches by the encoder. In contrast to two-stage learning approaches, the commonly used excessively sparse attention makes the encoder optimization overfitted on limited discriminative regions and vulnerable to redundant ones. Worse features further affect the accuracy of selection, compromising the optimization loop.
  • Figure 3: Overview of the proposed E2E training pipeline and ABMILX. ABMILX introduces multi-head local attention to address the extreme sparsity issue in ABMIL ilse2018attention, which hinders E2E optimization. Furthermore, ABMILX refines the local attention using global feature correlations via the attention plus. This encourages the model to focus on task-specific regions during E2E learning.
  • Figure 4: Attention visualization on the PANDA dataset panda. All slides are from the Karolinska Center, with annotations limited to three types: background, benign tissue, and cancerous tissue. We highlight cancerous tissue in blue and display high-attention patches as bright patches for comparison.
  • Figure 5: Heatmap visualization on the PANDA dataset panda. The top row shows original slide and its annotation (with cancerous tissue in red). The middle and bottom rows present attention maps generated by ResNet & ABMILX (E2E) and UNI & ABMIL (Offline) respectively. Color intensity ranges from blue (low attention) to red (high attention), illustrating how each approach prioritizes different tissue regions. Notably, our model yields a more uniform attention distribution while effectively highlighting cancerous areas.
  • ...and 4 more figures