Table of Contents
Fetching ...

Evaluating the Explainability of Vision Transformers in Medical Imaging

Leili Barekatain, Ben Glocker

TL;DR

The paper investigates the explainability of Vision Transformer (ViT) architectures in medical imaging by comparing ViT, DeiT, DINO, and Swin across peripheral blood cell and breast ultrasound classification tasks. It uses Gradient Attention Rollout and Grad-CAM to assess how faithfully and locally explanations reflect model decisions, revealing that Grad-CAM generally outperforms Gradient Attention Rollout and that DINO-ViT paired with Grad-CAM yields the most faithful explanations. While ViT and Swin sometimes achieve higher accuracy, their explanations are more diffuse, underscoring the need to balance predictive performance with interpretability in clinical settings. The study suggests future directions for ViT-specific explainability methods and incorporation of domain priors to enhance faithfulness and clinical trust in AI-assisted diagnostics.

Abstract

Understanding model decisions is crucial in medical imaging, where interpretability directly impacts clinical trust and adoption. Vision Transformers (ViTs) have demonstrated state-of-the-art performance in diagnostic imaging; however, their complex attention mechanisms pose challenges to explainability. This study evaluates the explainability of different Vision Transformer architectures and pre-training strategies - ViT, DeiT, DINO, and Swin Transformer - using Gradient Attention Rollout and Grad-CAM. We conduct both quantitative and qualitative analyses on two medical imaging tasks: peripheral blood cell classification and breast ultrasound image classification. Our findings indicate that DINO combined with Grad-CAM offers the most faithful and localized explanations across datasets. Grad-CAM consistently produces class-discriminative and spatially precise heatmaps, while Gradient Attention Rollout yields more scattered activations. Even in misclassification cases, DINO with Grad-CAM highlights clinically relevant morphological features that appear to have misled the model. By improving model transparency, this research supports the reliable and explainable integration of ViTs into critical medical diagnostic workflows.

Evaluating the Explainability of Vision Transformers in Medical Imaging

TL;DR

The paper investigates the explainability of Vision Transformer (ViT) architectures in medical imaging by comparing ViT, DeiT, DINO, and Swin across peripheral blood cell and breast ultrasound classification tasks. It uses Gradient Attention Rollout and Grad-CAM to assess how faithfully and locally explanations reflect model decisions, revealing that Grad-CAM generally outperforms Gradient Attention Rollout and that DINO-ViT paired with Grad-CAM yields the most faithful explanations. While ViT and Swin sometimes achieve higher accuracy, their explanations are more diffuse, underscoring the need to balance predictive performance with interpretability in clinical settings. The study suggests future directions for ViT-specific explainability methods and incorporation of domain priors to enhance faithfulness and clinical trust in AI-assisted diagnostics.

Abstract

Understanding model decisions is crucial in medical imaging, where interpretability directly impacts clinical trust and adoption. Vision Transformers (ViTs) have demonstrated state-of-the-art performance in diagnostic imaging; however, their complex attention mechanisms pose challenges to explainability. This study evaluates the explainability of different Vision Transformer architectures and pre-training strategies - ViT, DeiT, DINO, and Swin Transformer - using Gradient Attention Rollout and Grad-CAM. We conduct both quantitative and qualitative analyses on two medical imaging tasks: peripheral blood cell classification and breast ultrasound image classification. Our findings indicate that DINO combined with Grad-CAM offers the most faithful and localized explanations across datasets. Grad-CAM consistently produces class-discriminative and spatially precise heatmaps, while Gradient Attention Rollout yields more scattered activations. Even in misclassification cases, DINO with Grad-CAM highlights clinically relevant morphological features that appear to have misled the model. By improving model transparency, this research supports the reliable and explainable integration of ViTs into critical medical diagnostic workflows.

Paper Structure

This paper contains 14 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Visualization of (a,c) inserting and (b,d) deleting the most relevant pixels—identified by Grad-CAM and Gradient Attention Rollout—on the predicted class probability across four Vision Transformer models for the PBC dataset (a,b) and Breast Ultrasound dataset (c,d).
  • Figure 4: Grad-CAM visualizations of misclassified images by the DINO-ViT model. Top: PBC dataset — ground truth is Monocyte, but predicted as Immature Granulocyte with 67.43% confidence. Bottom: Breast ultrasound — ground truth is Benign, but predicted as Malignant with 79.17% confidence.
  • Figure : (a) Gradient Attention Rollout
  • Figure : (a) Gradient Attention Rollout
  • Figure : (a) Gradient Attention Rollout
  • ...and 3 more figures