When Medical Imaging Met Self-Attention: A Love Story That Didn't Quite Work Out

Tristan Piater; Niklas Penzel; Gideon Stein; Joachim Denzler

When Medical Imaging Met Self-Attention: A Love Story That Didn't Quite Work Out

Tristan Piater, Niklas Penzel, Gideon Stein, Joachim Denzler

TL;DR

This study assesses whether self-attention mechanisms improve medical image classification on two clinically relevant datasets (ISIC for skin lesions and Camelyon17 for tumor tissue). By extending ResNet18 and EfficientNet-B0 with global self-attention (GA), local self-attention (LA), and embedded local self-attention (ELA), and comparing to CNN and ViT baselines, the authors perform a hyperparameter-driven, statistically evaluated analysis. Across in-distribution and out-of-distribution splits, there are no statistically significant improvements over fully convolutional baselines; in some cases performance declines or biases emerge. Feature-usage and explainability analyses indicate that self-attention does not reliably learn medically relevant features, and explanation methods (Grad-CAM vs attention maps) offer limited added insight. The work suggests that simply incorporating attention is insufficient for clinical imaging and underscores the need for deeper architectural innovations and rigorous interpretability studies.

Abstract

A substantial body of research has focused on developing systems that assist medical professionals during labor-intensive early screening processes, many based on convolutional deep-learning architectures. Recently, multiple studies explored the application of so-called self-attention mechanisms in the vision domain. These studies often report empirical improvements over fully convolutional approaches on various datasets and tasks. To evaluate this trend for medical imaging, we extend two widely adopted convolutional architectures with different self-attention variants on two different medical datasets. With this, we aim to specifically evaluate the possible advantages of additional self-attention. We compare our models with similarly sized convolutional and attention-based baselines and evaluate performance gains statistically. Additionally, we investigate how including such layers changes the features learned by these models during the training. Following a hyperparameter search, and contrary to our expectations, we observe no significant improvement in balanced accuracy over fully convolutional models. We also find that important features, such as dermoscopic structures in skin lesion images, are still not learned by employing self-attention. Finally, analyzing local explanations, we confirm biased feature usage. We conclude that merely incorporating attention is insufficient to surpass the performance of existing fully convolutional methods.

When Medical Imaging Met Self-Attention: A Love Story That Didn't Quite Work Out

TL;DR

Abstract

When Medical Imaging Met Self-Attention: A Love Story That Didn't Quite Work Out

Authors

TL;DR

Abstract

Table of Contents

Figures (2)