Table of Contents
Fetching ...

Explainable Image Recognition via Enhanced Slot-attention Based Classifier

Bowen Wang, Liangzhi Li, Jiahao Zhang, Yuta Nakashima, Hajime Nagahara

TL;DR

E-SCOUTER presents an intrinsically explainable image classifier built on a modified slot-attention architecture that embeds explanations directly into decision scores. By supporting both positive and negative explanations and introducing an area loss to constrain explanation regions, it achieves strong interpretability without sacrificing competitive accuracy across diverse datasets, including ImageNet and medical imaging tasks. The normalization step enables scalability to large category sets, and case studies illustrate clinically meaningful, fine-grained explanations that align with expert annotations. Overall, the approach delivers state-of-the-art interpretability on multiple XAI metrics while maintaining robust classification performance, offering practical benefits for high-stakes visual analysis and AI-assisted diagnostics.

Abstract

The imperative to comprehend the behaviors of deep learning models is of utmost importance. In this realm, Explainable Artificial Intelligence (XAI) has emerged as a promising avenue, garnering increasing interest in recent years. Despite this, most existing methods primarily depend on gradients or input perturbation, which often fails to embed explanations directly within the model's decision-making process. Addressing this gap, we introduce ESCOUTER, a visually explainable classifier based on the modified slot attention mechanism. ESCOUTER distinguishes itself by not only delivering high classification accuracy but also offering more transparent insights into the reasoning behind its decisions. It differs from prior approaches in two significant aspects: (a) ESCOUTER incorporates explanations into the final confidence scores for each category, providing a more intuitive interpretation, and (b) it offers positive or negative explanations for all categories, elucidating "why an image belongs to a certain category" or "why it does not." A novel loss function specifically for ESCOUTER is designed to fine-tune the model's behavior, enabling it to toggle between positive and negative explanations. Moreover, an area loss is also designed to adjust the size of the explanatory regions for a more precise explanation. Our method, rigorously tested across various datasets and XAI metrics, outperformed previous state-of-the-art methods, solidifying its effectiveness as an explanatory tool.

Explainable Image Recognition via Enhanced Slot-attention Based Classifier

TL;DR

E-SCOUTER presents an intrinsically explainable image classifier built on a modified slot-attention architecture that embeds explanations directly into decision scores. By supporting both positive and negative explanations and introducing an area loss to constrain explanation regions, it achieves strong interpretability without sacrificing competitive accuracy across diverse datasets, including ImageNet and medical imaging tasks. The normalization step enables scalability to large category sets, and case studies illustrate clinically meaningful, fine-grained explanations that align with expert annotations. Overall, the approach delivers state-of-the-art interpretability on multiple XAI metrics while maintaining robust classification performance, offering practical benefits for high-stakes visual analysis and AI-assisted diagnostics.

Abstract

The imperative to comprehend the behaviors of deep learning models is of utmost importance. In this realm, Explainable Artificial Intelligence (XAI) has emerged as a promising avenue, garnering increasing interest in recent years. Despite this, most existing methods primarily depend on gradients or input perturbation, which often fails to embed explanations directly within the model's decision-making process. Addressing this gap, we introduce ESCOUTER, a visually explainable classifier based on the modified slot attention mechanism. ESCOUTER distinguishes itself by not only delivering high classification accuracy but also offering more transparent insights into the reasoning behind its decisions. It differs from prior approaches in two significant aspects: (a) ESCOUTER incorporates explanations into the final confidence scores for each category, providing a more intuitive interpretation, and (b) it offers positive or negative explanations for all categories, elucidating "why an image belongs to a certain category" or "why it does not." A novel loss function specifically for ESCOUTER is designed to fine-tune the model's behavior, enabling it to toggle between positive and negative explanations. Moreover, an area loss is also designed to adjust the size of the explanatory regions for a more precise explanation. Our method, rigorously tested across various datasets and XAI metrics, outperformed previous state-of-the-art methods, solidifying its effectiveness as an explanatory tool.
Paper Structure (23 sections, 17 equations, 12 figures, 7 tables)

This paper contains 23 sections, 17 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Explanations from E-SCOUTER. Using positive ($+$) and negative ($-$) E-SCOUTER losses can emphasize the positive and negative supports respectively, based on which one can understand why or why not the images are classified into a certain category.
  • Figure 2: Classification pipeline. (a) E-SCOUTER as a classifier. (b) The overview of E-SCOUTER for classification, where PE is position embedding, RE is reshape operation, $\sigma$ is sigmoid activation, and ($\cdot$) denotes dot multiplication.
  • Figure 3: Classification performance of different models with FC classifier, E-SCOUTER$^{+}$ ($\lambda=10$), and E-SCOUTER$^{-}$ ($\lambda=10$). The horizontal axis is the number of categories, where the first $n$ categories of the ImageNet dataset were used; the vertical axis is the accuracy.
  • Figure 4: Positive supports for MNIST MNIST. Using ResNet-18 he2016deep as backbone and $\lambda$=10.
  • Figure 5: Negative supports for MNIST MNIST. Using ResNet-18 he2016deep as backbone and $\lambda$=10.
  • ...and 7 more figures