Table of Contents
Fetching ...

Hierarchical Vision Transformer with Prototypes for Interpretable Medical Image Classification

Luisa Gallée, Catharina Silvia Lisson, Meinrad Beer, Michael Götz

TL;DR

The paper addresses the need for explainable AI in medical image classification where attention alone is insufficient. It introduces HierViT, a 12-layer Vision Transformer augmented with a hierarchical attribute predictor, a target predictor, and an optional segmentation head, all grounded by prototype-based exemplars and attention heatmaps to align decisions with human-defined features. Training combines loss terms $L_{tar}$, $L_{attr}$, $L_{seg}$, and a prototype loss $L_{proto}$ weighted at $0.01$, with prototypes populated via a push mechanism after a warm-up. On LIDC-IDRI, HierViT achieves state-of-the-art Within-1-Accuracy, and on derm7pt it remains competitive with strong target and attribute performance, while providing richer explanations that support clinician validation and trust. These contributions offer a path toward intrinsically interpretable high-performance ViTs in medicine, though limitations include annotation scarcity and opportunities for multimodal extensions and data synthesis.

Abstract

Explainability is a highly demanded requirement for applications in high-risk areas such as medicine. Vision Transformers have mainly been limited to attention extraction to provide insight into the model's reasoning. Our approach combines the high performance of Vision Transformers with the introduction of new explainability capabilities. We present HierViT, a Vision Transformer that is inherently interpretable and adapts its reasoning to that of humans. A hierarchical structure is used to process domain-specific features for prediction. It is interpretable by design, as it derives the target output with human-defined features that are visualized by exemplary images (prototypes). By incorporating domain knowledge about these decisive features, the reasoning is semantically similar to human reasoning and therefore intuitive. Moreover, attention heatmaps visualize the crucial regions for identifying each feature, thereby providing HierViT with a versatile tool for validating predictions. Evaluated on two medical benchmark datasets, LIDC-IDRI for lung nodule assessment and derm7pt for skin lesion classification, HierViT achieves superior and comparable prediction accuracy, respectively, while offering explanations that align with human reasoning.

Hierarchical Vision Transformer with Prototypes for Interpretable Medical Image Classification

TL;DR

The paper addresses the need for explainable AI in medical image classification where attention alone is insufficient. It introduces HierViT, a 12-layer Vision Transformer augmented with a hierarchical attribute predictor, a target predictor, and an optional segmentation head, all grounded by prototype-based exemplars and attention heatmaps to align decisions with human-defined features. Training combines loss terms , , , and a prototype loss weighted at , with prototypes populated via a push mechanism after a warm-up. On LIDC-IDRI, HierViT achieves state-of-the-art Within-1-Accuracy, and on derm7pt it remains competitive with strong target and attribute performance, while providing richer explanations that support clinician validation and trust. These contributions offer a path toward intrinsically interpretable high-performance ViTs in medicine, though limitations include annotation scarcity and opportunities for multimodal extensions and data synthesis.

Abstract

Explainability is a highly demanded requirement for applications in high-risk areas such as medicine. Vision Transformers have mainly been limited to attention extraction to provide insight into the model's reasoning. Our approach combines the high performance of Vision Transformers with the introduction of new explainability capabilities. We present HierViT, a Vision Transformer that is inherently interpretable and adapts its reasoning to that of humans. A hierarchical structure is used to process domain-specific features for prediction. It is interpretable by design, as it derives the target output with human-defined features that are visualized by exemplary images (prototypes). By incorporating domain knowledge about these decisive features, the reasoning is semantically similar to human reasoning and therefore intuitive. Moreover, attention heatmaps visualize the crucial regions for identifying each feature, thereby providing HierViT with a versatile tool for validating predictions. Evaluated on two medical benchmark datasets, LIDC-IDRI for lung nodule assessment and derm7pt for skin lesion classification, HierViT achieves superior and comparable prediction accuracy, respectively, while offering explanations that align with human reasoning.

Paper Structure

This paper contains 12 sections, 7 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Proposed model The patchified image is linearly projected and processed by a transformer encoder, producing a token vector that serves as the input for both a hierarchical classifier and a decoder. The hierarchical classifier processes the token vector through multiple transformer layers, one for each attribute, with individual heads providing attribute ratings. For target prediction, the token vectors from the attribute layers are stacked and further processed by the target branch. The optional decoder segments a region of interest mask.
  • Figure 2: Reasoning process Three sample cases are illustrated, (a) correctly predicted, (b) and (c) incorrectly predicted. For three of the eight attributes (spiculation, sphericity, lobulation), the score, attention heatmap, and prototype image of the respective attribute are displayed.