Key Patches Are All You Need: A Multiple Instance Learning Framework For Robust Medical Diagnosis

Diogo J. Araújo; M. Rita Verdelho; Alceu Bissoto; Jacinto C. Nascimento; Carlos Santiago; Catarina Barata

Key Patches Are All You Need: A Multiple Instance Learning Framework For Robust Medical Diagnosis

Diogo J. Araújo, M. Rita Verdelho, Alceu Bissoto, Jacinto C. Nascimento, Carlos Santiago, Catarina Barata

TL;DR

The paper addresses the susceptibility of deep learning models to dataset biases in medical image analysis by introducing a multiple instance learning (MIL) framework that restricts classification to a small subset of image patches, aligning with clinical ROI-based decision making. It integrates MIL on top of CNN and ViT patch encoders, supporting both instance-level and embedding-level pooling, and evaluates on skin (dermoscopy) and breast (mammography) datasets. Results show that MIL maintains competitive in-domain performance while improving robustness to demographic shifts and enabling more interpretable, patch-level explanations. This approach offers a path toward more reliable, fair, and clinically translatable medical imaging systems without sacrificing accuracy.

Abstract

Deep learning models have revolutionized the field of medical image analysis, due to their outstanding performances. However, they are sensitive to spurious correlations, often taking advantage of dataset bias to improve results for in-domain data, but jeopardizing their generalization capabilities. In this paper, we propose to limit the amount of information these models use to reach the final classification, by using a multiple instance learning (MIL) framework. MIL forces the model to use only a (small) subset of patches in the image, identifying discriminative regions. This mimics the clinical procedures, where medical decisions are based on localized findings. We evaluate our framework on two medical applications: skin cancer diagnosis using dermoscopy and breast cancer diagnosis using mammography. Our results show that using only a subset of the patches does not compromise diagnostic performance for in-domain data, compared to the baseline approaches. However, our approach is more robust to shifts in patient demographics, while also providing more detailed explanations about which regions contributed to the decision. Code is available at: https://github.com/diogojpa99/MedicalMultiple-Instance-Learning.

Key Patches Are All You Need: A Multiple Instance Learning Framework For Robust Medical Diagnosis

TL;DR

Abstract

Paper Structure (24 sections, 6 figures, 6 tables)

This paper contains 24 sections, 6 figures, 6 tables.

Introduction
Related Work
Proposed Approach
Patch Encoder Block
CNN Encoder
ViT Encoder
MIL Block
Instance-level Approach
Embedding-level Approach
Experimental Setup
Datasets
Training Setup
Experimental Results
Binary MIL
Multi-class MIL
...and 9 more sections

Figures (6)

Figure 1: Overview of the proposed approach. An encoder block (CNN or ViT-based) extracts patch representations from the input image. Each patch will be an instance of a bag. Then, a MIL block determines the bag label using an instance or embedding-level approach.
Figure 2: Visualization of the key patches identified by two different MIL approaches. On the left, we have the instance-level MIL-EN-B3 using max polling, and on the right, we have the instance-level MIL-EN-B3 using the top-$k$ average pooling operator, with $k\approx12.5\%$, The images used for visualization are taken from the $\mathrm{PH}^2$ test set and refer to the binary classification task of melanoma vs. nevus.
Figure 3: Grad-Cam Shah_null heatmap visualizations generated by the EN-B3 baseline model for images from the $\mathrm{PH}^2$ test set.
Figure 4: Search for the optimal $k$ hyperparameter in the instance-level top-$k$ average MIL pooling operator. We explored three values for the hyperparameter: $k\approx12.5\%$, $k=25\%$, and $k=50\%$. Our experiments were conducted and evaluated on the validation set of the ISIC 2019 dataset, employing different MIL backbones. The backbones included RN-18, RN-50, VGG16, DN-169, EN-B3, DEiT-S, DEiT-cls (DEiT with the CLS token), EViT-S, and EViT-fused (EViT with the fused embedding). Notably, with EViT backbones, $k\approx12.5\%$ resulted in only $9$ patches, $k=25\%$ retained $17$ patches, and $k=50\%$ maintained $34$ patches. These results indicate that using more patches in the bag evaluation does not necessarily lead to better performance.
Figure 5: Performance comparison between EViT-S models with different keep rates. The experiments were conducted on the ISIC 2019 validation dataset, as well as on the two test datasets: $\mathrm{PH}^2$ and Derm7pt, for the binary classification task of melanoma versus nevus. The $x$-axis represents the different $K_r$ values, while the $y$-axis represents the corresponding BA results.
...and 1 more figures

Key Patches Are All You Need: A Multiple Instance Learning Framework For Robust Medical Diagnosis

TL;DR

Abstract

Key Patches Are All You Need: A Multiple Instance Learning Framework For Robust Medical Diagnosis

Authors

TL;DR

Abstract

Table of Contents

Figures (6)