Table of Contents
Fetching ...

Evaluating Visual Explanations of Attention Maps for Transformer-based Medical Imaging

Minjae Chung, Jong Bum Won, Ganghyun Kim, Yujin Kim, Utku Ozbulak

TL;DR

The paper investigates explainability for Vision Transformer–based medical imaging models and questions the reliability of attention-map explanations. It compares attention maps against GradCAM and the Chefer method across four datasets (CP-Child, DUKE, Kvasir, MURA) using ViT-B/16 models pretrained with random, supervised, DINO, and MAE, evaluating explanations with pointing-game accuracy and IoU. The findings indicate GradCAM typically underperforms; attention maps are promising but transformer-specific methods like Chefer generally provide more robust explanations, with performance varying by dataset and pretraining. The authors also caution against relying on bounding-box annotations for evaluation and advocate segmentation maps for precise interpretability assessment, offering practical guidance for selecting explainability methods in clinical contexts.

Abstract

Although Vision Transformers (ViTs) have recently demonstrated superior performance in medical imaging problems, they face explainability issues similar to previous architectures such as convolutional neural networks. Recent research efforts suggest that attention maps, which are part of decision-making process of ViTs can potentially address the explainability issue by identifying regions influencing predictions, especially in models pretrained with self-supervised learning. In this work, we compare the visual explanations of attention maps to other commonly used methods for medical imaging problems. To do so, we employ four distinct medical imaging datasets that involve the identification of (1) colonic polyps, (2) breast tumors, (3) esophageal inflammation, and (4) bone fractures and hardware implants. Through large-scale experiments on the aforementioned datasets using various supervised and self-supervised pretrained ViTs, we find that although attention maps show promise under certain conditions and generally surpass GradCAM in explainability, they are outperformed by transformer-specific interpretability methods. Our findings indicate that the efficacy of attention maps as a method of interpretability is context-dependent and may be limited as they do not consistently provide the comprehensive insights required for robust medical decision-making.

Evaluating Visual Explanations of Attention Maps for Transformer-based Medical Imaging

TL;DR

The paper investigates explainability for Vision Transformer–based medical imaging models and questions the reliability of attention-map explanations. It compares attention maps against GradCAM and the Chefer method across four datasets (CP-Child, DUKE, Kvasir, MURA) using ViT-B/16 models pretrained with random, supervised, DINO, and MAE, evaluating explanations with pointing-game accuracy and IoU. The findings indicate GradCAM typically underperforms; attention maps are promising but transformer-specific methods like Chefer generally provide more robust explanations, with performance varying by dataset and pretraining. The authors also caution against relying on bounding-box annotations for evaluation and advocate segmentation maps for precise interpretability assessment, offering practical guidance for selecting explainability methods in clinical contexts.

Abstract

Although Vision Transformers (ViTs) have recently demonstrated superior performance in medical imaging problems, they face explainability issues similar to previous architectures such as convolutional neural networks. Recent research efforts suggest that attention maps, which are part of decision-making process of ViTs can potentially address the explainability issue by identifying regions influencing predictions, especially in models pretrained with self-supervised learning. In this work, we compare the visual explanations of attention maps to other commonly used methods for medical imaging problems. To do so, we employ four distinct medical imaging datasets that involve the identification of (1) colonic polyps, (2) breast tumors, (3) esophageal inflammation, and (4) bone fractures and hardware implants. Through large-scale experiments on the aforementioned datasets using various supervised and self-supervised pretrained ViTs, we find that although attention maps show promise under certain conditions and generally surpass GradCAM in explainability, they are outperformed by transformer-specific interpretability methods. Our findings indicate that the efficacy of attention maps as a method of interpretability is context-dependent and may be limited as they do not consistently provide the comprehensive insights required for robust medical decision-making.

Paper Structure

This paper contains 8 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Examples of medical images in (a) CP-Child, (b) DUKE, (c) Kvasir, and (d) MURA datasets. The left images shows benign (disease-negative) images, while the right shows malignant (disease-positive) images for each dataset. Red annotation boxes highlight the regions with diseases.
  • Figure 2: Evaluation of interpretability maps is visualized. The right side illustrates the pointing game, identifying the most significant point and checking for a hit against the annotation box (red). The left part illustrates the IoU overlap, including the thresholding opeartion to create a binary mask and the calculation of IoU.
  • Figure 3: Qualitative examples generated using interpretability methods from Section \ref{['sec:int_methods']} on the four datasets employed in this study. Red boxes highlight annotations made by medical experts, whereas blue boxes indicate regions with intensity levels exceeding the top 5%, as identified by the interpretability methods. When the red and blue areas overlap, it indicates a high IoU score, whereas low overlap indicates a low score. When the two boxes do not intersect, it means that the IoU is zero.
  • Figure 4: Box plots showing the IoUs of evaluated interpretability methods for datasets from Section \ref{['sec:dataset']}. Each panel illustrates the distribution of IoU values across four pretraining strategies: Random, Supervised, DINO, and MAE.