Table of Contents
Fetching ...

ViConEx-Med: Visual Concept Explainability via Multi-Concept Token Transformer for Medical Image Analysis

Cristiano Patrício, Luís F. Teixeira, João C. Neves

TL;DR

ViConEx-Med, a novel transformer-based framework for visual concept explainability, which introduces multi-concept learnable tokens to jointly predict and localize visual concepts and suggests a promising direction for building inherently interpretable models grounded in visual concepts.

Abstract

Concept-based models aim to explain model decisions with human-understandable concepts. However, most existing approaches treat concepts as numerical attributes, without providing complementary visual explanations that could localize the predicted concepts. This limits their utility in real-world applications and particularly in high-stakes scenarios, such as medical use-cases. This paper proposes ViConEx-Med, a novel transformer-based framework for visual concept explainability, which introduces multi-concept learnable tokens to jointly predict and localize visual concepts. By leveraging specialized attention layers for processing visual and text-based concept tokens, our method produces concept-level localization maps while maintaining high predictive accuracy. Experiments on both synthetic and real-world medical datasets demonstrate that ViConEx-Med outperforms prior concept-based models and achieves competitive performance with black-box models in terms of both concept detection and localization precision. Our results suggest a promising direction for building inherently interpretable models grounded in visual concepts. Code is publicly available at https://github.com/CristianoPatricio/viconex-med.

ViConEx-Med: Visual Concept Explainability via Multi-Concept Token Transformer for Medical Image Analysis

TL;DR

ViConEx-Med, a novel transformer-based framework for visual concept explainability, which introduces multi-concept learnable tokens to jointly predict and localize visual concepts and suggests a promising direction for building inherently interpretable models grounded in visual concepts.

Abstract

Concept-based models aim to explain model decisions with human-understandable concepts. However, most existing approaches treat concepts as numerical attributes, without providing complementary visual explanations that could localize the predicted concepts. This limits their utility in real-world applications and particularly in high-stakes scenarios, such as medical use-cases. This paper proposes ViConEx-Med, a novel transformer-based framework for visual concept explainability, which introduces multi-concept learnable tokens to jointly predict and localize visual concepts. By leveraging specialized attention layers for processing visual and text-based concept tokens, our method produces concept-level localization maps while maintaining high predictive accuracy. Experiments on both synthetic and real-world medical datasets demonstrate that ViConEx-Med outperforms prior concept-based models and achieves competitive performance with black-box models in terms of both concept detection and localization precision. Our results suggest a promising direction for building inherently interpretable models grounded in visual concepts. Code is publicly available at https://github.com/CristianoPatricio/viconex-med.

Paper Structure

This paper contains 28 sections, 10 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: (a) Concept Bottleneck Models (CBMs) provide concept predictions but lack spatially-grounded visual explanations. (b)ViConEx-Med produces faithful visual explanations for predicted concepts, enabling human-in-the-loop decision-making and fostering trust in clinical practice.
  • Figure 2: Illustration of the proposed ViConEx-Med. (a.) Multi-Concept Token Transformer encoder with specialized layers for processing multiple learnable visual concept tokens alongside complementary text-based concept tokens. (b.) Text-guided concept enhancement module, leveraging domain knowledge from a medical foundation model to provide complementary semantic information that guide the visual concept tokens. (c.) Contrastive concept token regularization applied to the output visual concept tokens to promote discriminativeness between visual tokens. (d.) At inference time, the visual concept localization maps are generated by fusing the multi-modal concept tokens with Patch CAM, followed by refinement using the pairwise affinity matrix derived from patch-to-patch attention.
  • Figure 3: The CAM module. The output patch tokens from last layer of the transformer encoder are reshaped and then passed through a convolutional layer to reduce the depth to match the number of concepts. The resulting features are processed via Global Max Pooling to produce the concept scores.
  • Figure 4: Overview of the synthetic dataset generation pipeline. Step 1: A lesion mask and a randomly selected dermoscopic attribute mask are sampled from real skin datasets to serve as structural priors for the synthetic image. Step 2: A background texture is generated using two consecutive sampled skin tones. Step 3: A color combination is drawn from the color bank: the first color defines the lesion background color, while the remaining colors are applied to the attribute regions. The output includes the generated synthetic image along with the lesion mask, border mask and color masks.
  • Figure 5: Statistics for the SynSkin dataset.(a) Color distribution. (b) Examples of synthetic images. (c) Statistics for SynSkin.
  • ...and 3 more figures