Table of Contents
Fetching ...

MsaMIL-Net: An End-to-End Multi-Scale Aware Multiple Instance Learning Network for Efficient Whole Slide Image Classification

Jiangping Wen, Jinyu Wen, Meie Fang

TL;DR

MsaMIL-Net tackles the bottleneck of end-to-end WSI classification by integrating semantic lesion filtering, multi-scale feature extraction, and cross-scale instance-aware fusion within a differentiable MIL framework. It enables joint optimization of feature extractors and MIL components across three native scales ($20\times$, $10\times$, $5\times$) using an end-to-end training strategy, improving ACC and AUC on DigestPath2019, BCNB, and UBC-OCEAN datasets. The key contributions are the SFFM for non-lesion interference reduction, the MSFEM for cross-scale semantic alignment, and the IAAM with MHE and DMQ for robust cross-scale aggregation, achieving state-of-the-art performance while reducing computational load via targeted lesion-area processing. Overall, the framework demonstrates that end-to-end, multi-scale MIL with efficient filtering and attention-based fusion can substantially enhance WSI classification in biomedical imaging with practical efficiency gains.

Abstract

Bag-based Multiple Instance Learning (MIL) approaches have emerged as the mainstream methodology for Whole Slide Image (WSI) classification. However, most existing methods adopt a segmented training strategy, which first extracts features using a pre-trained feature extractor and then aggregates these features through MIL. This segmented training approach leads to insufficient collaborative optimization between the feature extraction network and the MIL network, preventing end-to-end joint optimization and thereby limiting the overall performance of the model. Additionally, conventional methods typically extract features from all patches of fixed size, ignoring the multi-scale observation characteristics of pathologists. This not only results in significant computational resource waste when tumor regions represent a minimal proportion (as in the Camelyon16 dataset) but may also lead the model to suboptimal solutions. To address these limitations, this paper proposes an end-to-end multi-scale WSI classification framework that integrates multi-scale feature extraction with multiple instance learning. Specifically, our approach includes: (1) a semantic feature filtering module to reduce interference from non-lesion areas; (2) a multi-scale feature extraction module to capture pathological information at different levels; and (3) a multi-scale fusion MIL module for global modeling and feature integration. Through an end-to-end training strategy, we simultaneously optimize both the feature extractor and MIL network, ensuring maximum compatibility between them. Experiments were conducted on three cross-center datasets (DigestPath2019, BCNB, and UBC-OCEAN). Results demonstrate that our proposed method outperforms existing state-of-the-art approaches in terms of both accuracy (ACC) and AUC metrics.

MsaMIL-Net: An End-to-End Multi-Scale Aware Multiple Instance Learning Network for Efficient Whole Slide Image Classification

TL;DR

MsaMIL-Net tackles the bottleneck of end-to-end WSI classification by integrating semantic lesion filtering, multi-scale feature extraction, and cross-scale instance-aware fusion within a differentiable MIL framework. It enables joint optimization of feature extractors and MIL components across three native scales (, , ) using an end-to-end training strategy, improving ACC and AUC on DigestPath2019, BCNB, and UBC-OCEAN datasets. The key contributions are the SFFM for non-lesion interference reduction, the MSFEM for cross-scale semantic alignment, and the IAAM with MHE and DMQ for robust cross-scale aggregation, achieving state-of-the-art performance while reducing computational load via targeted lesion-area processing. Overall, the framework demonstrates that end-to-end, multi-scale MIL with efficient filtering and attention-based fusion can substantially enhance WSI classification in biomedical imaging with practical efficiency gains.

Abstract

Bag-based Multiple Instance Learning (MIL) approaches have emerged as the mainstream methodology for Whole Slide Image (WSI) classification. However, most existing methods adopt a segmented training strategy, which first extracts features using a pre-trained feature extractor and then aggregates these features through MIL. This segmented training approach leads to insufficient collaborative optimization between the feature extraction network and the MIL network, preventing end-to-end joint optimization and thereby limiting the overall performance of the model. Additionally, conventional methods typically extract features from all patches of fixed size, ignoring the multi-scale observation characteristics of pathologists. This not only results in significant computational resource waste when tumor regions represent a minimal proportion (as in the Camelyon16 dataset) but may also lead the model to suboptimal solutions. To address these limitations, this paper proposes an end-to-end multi-scale WSI classification framework that integrates multi-scale feature extraction with multiple instance learning. Specifically, our approach includes: (1) a semantic feature filtering module to reduce interference from non-lesion areas; (2) a multi-scale feature extraction module to capture pathological information at different levels; and (3) a multi-scale fusion MIL module for global modeling and feature integration. Through an end-to-end training strategy, we simultaneously optimize both the feature extractor and MIL network, ensuring maximum compatibility between them. Experiments were conducted on three cross-center datasets (DigestPath2019, BCNB, and UBC-OCEAN). Results demonstrate that our proposed method outperforms existing state-of-the-art approaches in terms of both accuracy (ACC) and AUC metrics.

Paper Structure

This paper contains 20 sections, 9 equations, 3 figures, 3 tables, 2 algorithms.

Figures (3)

  • Figure 1: This paper introduces an end-to-end Multi-scale aware multiple instance learning network framework (MsaMIL-Net), which primarily comprises three modules: the semantic feature filtering module (SFFM), the multi-scale feature extraction module (MSFEM), and the instance-aware attention module (IAAM). Among them, SFFM can effectively reduce the interference of non-lesion areas in WSI classification tasks while enhancing inference speed and efficiency. MSFEM enables cross-scale semantic alignment, ranging from microscopic cellular morphology to macroscopic tissue architecture. IAAM enhances the model's capability to capture instance-level details, facilitating deep interaction across multi-scale information.
  • Figure 2: The reasoning effectiveness of segmentation
  • Figure 3: (a): Changes in IoU during the training process of U-Net++; (b): Comparison between self-supervised learning with DINO and end-to-end training methods.