Evaluating Visual Explanations of Attention Maps for Transformer-based Medical Imaging
Minjae Chung, Jong Bum Won, Ganghyun Kim, Yujin Kim, Utku Ozbulak
TL;DR
The paper investigates explainability for Vision Transformer–based medical imaging models and questions the reliability of attention-map explanations. It compares attention maps against GradCAM and the Chefer method across four datasets (CP-Child, DUKE, Kvasir, MURA) using ViT-B/16 models pretrained with random, supervised, DINO, and MAE, evaluating explanations with pointing-game accuracy and IoU. The findings indicate GradCAM typically underperforms; attention maps are promising but transformer-specific methods like Chefer generally provide more robust explanations, with performance varying by dataset and pretraining. The authors also caution against relying on bounding-box annotations for evaluation and advocate segmentation maps for precise interpretability assessment, offering practical guidance for selecting explainability methods in clinical contexts.
Abstract
Although Vision Transformers (ViTs) have recently demonstrated superior performance in medical imaging problems, they face explainability issues similar to previous architectures such as convolutional neural networks. Recent research efforts suggest that attention maps, which are part of decision-making process of ViTs can potentially address the explainability issue by identifying regions influencing predictions, especially in models pretrained with self-supervised learning. In this work, we compare the visual explanations of attention maps to other commonly used methods for medical imaging problems. To do so, we employ four distinct medical imaging datasets that involve the identification of (1) colonic polyps, (2) breast tumors, (3) esophageal inflammation, and (4) bone fractures and hardware implants. Through large-scale experiments on the aforementioned datasets using various supervised and self-supervised pretrained ViTs, we find that although attention maps show promise under certain conditions and generally surpass GradCAM in explainability, they are outperformed by transformer-specific interpretability methods. Our findings indicate that the efficacy of attention maps as a method of interpretability is context-dependent and may be limited as they do not consistently provide the comprehensive insights required for robust medical decision-making.
