Table of Contents
Fetching ...

SCAN: Visual Explanations with Self-Confidence and Analysis Networks

Gwanghee Lee, Sungyoon Jeong, Kyoungson Jhang

TL;DR

By providing a unified framework that is both architecturally universal and highly faithful, SCAN enhances model transparency and offers a more reliable tool for understanding the decision-making processes of complex neural networks.

Abstract

Explainable AI (XAI) has become essential in computer vision to make the decision-making processes of deep learning models transparent. However, current visual explanation (XAI) methods face a critical trade-off between the high fidelity of architecture-specific methods and the broad applicability of universal ones. This often results in abstract or fragmented explanations and makes it difficult to compare explanatory power across diverse model families, such as CNNs and Transformers. This paper introduces the Self-Confidence and Analysis Networks (SCAN), a novel universal framework that overcomes these limitations for both convolutional neural network and transformer architectures. SCAN utilizes an AutoEncoder-based approach to reconstruct features from a model's intermediate layers. Guided by the Information Bottleneck principle, it generates a high-resolution Self-Confidence Map that identifies information-rich regions. Extensive experiments on diverse architectures and datasets demonstrate that SCAN consistently achieves outstanding performance on various quantitative metrics such as AUC-D, Negative AUC, Drop%, and Win%. Qualitatively, it produces significantly clearer, object-focused explanations than existing methods. By providing a unified framework that is both architecturally universal and highly faithful, SCAN enhances model transparency and offers a more reliable tool for understanding the decision-making processes of complex neural networks.

SCAN: Visual Explanations with Self-Confidence and Analysis Networks

TL;DR

By providing a unified framework that is both architecturally universal and highly faithful, SCAN enhances model transparency and offers a more reliable tool for understanding the decision-making processes of complex neural networks.

Abstract

Explainable AI (XAI) has become essential in computer vision to make the decision-making processes of deep learning models transparent. However, current visual explanation (XAI) methods face a critical trade-off between the high fidelity of architecture-specific methods and the broad applicability of universal ones. This often results in abstract or fragmented explanations and makes it difficult to compare explanatory power across diverse model families, such as CNNs and Transformers. This paper introduces the Self-Confidence and Analysis Networks (SCAN), a novel universal framework that overcomes these limitations for both convolutional neural network and transformer architectures. SCAN utilizes an AutoEncoder-based approach to reconstruct features from a model's intermediate layers. Guided by the Information Bottleneck principle, it generates a high-resolution Self-Confidence Map that identifies information-rich regions. Extensive experiments on diverse architectures and datasets demonstrate that SCAN consistently achieves outstanding performance on various quantitative metrics such as AUC-D, Negative AUC, Drop%, and Win%. Qualitatively, it produces significantly clearer, object-focused explanations than existing methods. By providing a unified framework that is both architecturally universal and highly faithful, SCAN enhances model transparency and offers a more reliable tool for understanding the decision-making processes of complex neural networks.
Paper Structure (33 sections, 10 equations, 9 figures, 9 tables)

This paper contains 33 sections, 10 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: SCAN process. Feature maps are extracted from the target model and reconstructed, and then important regions containing significant information are visualized using the self-confidence map.
  • Figure 2: Analysis networks for CNN and transformer models. The ResNet-based decoder is optimized for CNN model structures, while the transformer-based decoder is designed for transformer model structures.
  • Figure 3: Qualitative comparison of visual explanation methods for a ViT-b16 model trained on ImageNet. Compared to baselines such as Raw Attention, Rollout, and others, SCAN generates a more coherent and object-focused explanation.
  • Figure 4: Qualitative comparison of SCAN and other methods on ResNet50V2. While conventional methods generate abstract saliency maps, SCAN produced more distinct explanations with clear object boundaries.
  • Figure 5: Qualitative results across various models. SCAN consistently generated clear and object-focused explanations.
  • ...and 4 more figures