SAM-MIL: A Spatial Contextual Aware Multiple Instance Learning Approach for Whole Slide Image Classification

Heng Fang; Sheng Huang; Wenhao Tang; Luwen Huangfu; Bo Liu

SAM-MIL: A Spatial Contextual Aware Multiple Instance Learning Approach for Whole Slide Image Classification

Heng Fang, Sheng Huang, Wenhao Tang, Luwen Huangfu, Bo Liu

TL;DR

This paper tackles the limitation of traditional MIL for whole-slide image classification, which largely ignores global spatial context among patches. It introduces SAM-MIL, a spatial-contextualaware MIL framework that leverages the Segment Anything Model to extract region-level context and integrates it through SAM-guided group masking, region-based group features, and a consistency loss applied to pseudo-bags. Empirical results on CAMELYON-16 and TCGA-Lung Cancer show state-of-the-art AUROC improvements over strong MIL baselines, confirming the value of explicit spatial context in WSI analysis. The approach also provides a plug-in SAM-based feature extractor and reveals insights into how spatial context can guide attention and aggregation in MIL, offering practical implications for pathology image analysis.

Abstract

Multiple Instance Learning (MIL) represents the predominant framework in Whole Slide Image (WSI) classification, covering aspects such as sub-typing, diagnosis, and beyond. Current MIL models predominantly rely on instance-level features derived from pretrained models such as ResNet. These models segment each WSI into independent patches and extract features from these local patches, leading to a significant loss of global spatial context and restricting the model's focus to merely local features. To address this issue, we propose a novel MIL framework, named SAM-MIL, that emphasizes spatial contextual awareness and explicitly incorporates spatial context by extracting comprehensive, image-level information. The Segment Anything Model (SAM) represents a pioneering visual segmentation foundational model that can capture segmentation features without the need for additional fine-tuning, rendering it an outstanding tool for extracting spatial context directly from raw WSIs. Our approach includes the design of group feature extraction based on spatial context and a SAM-Guided Group Masking strategy to mitigate class imbalance issues. We implement a dynamic mask ratio for different segmentation categories and supplement these with representative group features of categories. Moreover, SAM-MIL divides instances to generate additional pseudo-bags, thereby augmenting the training set, and introduces consistency of spatial context across pseudo-bags to further enhance the model's performance. Experimental results on the CAMELYON-16 and TCGA Lung Cancer datasets demonstrate that our proposed SAM-MIL model outperforms existing mainstream methods in WSIs classification. Our open-source implementation code is is available at https://github.com/FangHeng/SAM-MIL.

SAM-MIL: A Spatial Contextual Aware Multiple Instance Learning Approach for Whole Slide Image Classification

TL;DR

Abstract

Paper Structure (32 sections, 13 equations, 11 figures, 7 tables)

This paper contains 32 sections, 13 equations, 11 figures, 7 tables.

Introduction
Related Work
Multiple Instance Learning in WSI Analysis
SAM in Medical Image Analysis
Proposed Method
Preliminary
Spatial Contextual Aware WSI Classification
SAM-Guided Group Masking Strategy
Pseudo-Bag & Consistency Loss
SAM-based MIL
Experiments and Results
WSI Preprocessing
Datasets and Evaluation Metrics
Performance Comparison
Ablation Study
...and 17 more sections

Figures (11)

Figure 1: Top: The conventional pathologists' assessment strongly relies on the spatial contextual features in the WSI. Middle: The conventional MIL paradigm relies solely on individual features, overlooking the global spatial context between patches. Bottom: The proposed MIL paradigm introduces the SAM, which utilizes the spatial context and guides the optimization of the MIL model.
Figure 2: Overview of the proposed SAM-Guided WSI classification model. In the Feature Extractor stage, WSI slides are segmented into corresponding tissues. Following the patching operation, each tissue sequentially extracts features from each patch. Simultaneously, SAM performs segmentation on the entire slide, extracting representative features from each region as group features based on spatial context. In the Feature Aggregation stage, the spatial context of SAM is utilized at two levels. At the Instance level, instance grouping masks are applied under the guidance of SAM, while at the Bag level, pseudo-bag training loss and consistency loss are calculated under SAM's guidance to constrain the model's training. This approach ensures that both detailed instance-level features and the broader bag-level insights contribute to the model's learning process.
Figure 3: Illustration of proposed masking strategy. We propose three masking strategies for instances. The first two strategies involve randomized masks. The third strategy is our proposed spatial context-based SAM-Guided Group Masking (SG$^2$M), which groups various SAM segmentation categories and enforces a dynamic mask ratio within each group.
Figure 4: Illustration of Pseudo-Bag Loss & Consistency Loss.
Figure 5: The figure illustrates the comparison between the slides and their corresponding SAM segmentation results. The first row displays samples of the original slides, including both tumor and normal slides, arranged in descending order by resolution. In the tumor slides, blue lines outline the tumor regions. The arrangement by resolution emphasizes SAM's segmentation performance from macroscopic disease areas to microscopic detail. It is evident that SAM can accurately delineate diseased areas at different scales, and effectively segment normal slides based on visual information.
...and 6 more figures

SAM-MIL: A Spatial Contextual Aware Multiple Instance Learning Approach for Whole Slide Image Classification

TL;DR

Abstract

SAM-MIL: A Spatial Contextual Aware Multiple Instance Learning Approach for Whole Slide Image Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (11)