WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts

Yong Hyun Ahn; Hyeon Bae Kim; Seong Tae Kim

WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts

Yong Hyun Ahn, Hyeon Bae Kim, Seong Tae Kim

TL;DR

WWW presents a unified, plug-and-play framework for neural network interpretability that simultaneously explains what concepts neurons represent, where those concepts manifest in input regions, and why decisions are made, using Adaptive Cosine Similarity for concept discovery, Neuron Activation Maps for localization, and Shapley-value based reasoning. The approach yields class- and sample-level explanations with localized concept maps and heatmaps, enabling uncertainty estimation via heatmap similarity and offering robust, interpretable insights across CNNs and Vision Transformers. Key contributions include (i) a concept discovery module that identifies major/minor neuron concepts, (ii) a localization module that produces targeted concept heatmaps, and (iii) a reasoning module that uses class-wise Shapley contributions to explain and validate predictions, all in a model-agnostic, plug-and-play fashion. Empirical results show WWW outperforms baselines on both qualitative interpretability and quantitative metrics, while ablations demonstrate the importance of ACS and carefully tuned concept sensitivity, underscoring the practical impact for trustworthy AI in diverse architectures.

Abstract

Recent advancements in neural networks have showcased their remarkable capabilities across various domains. Despite these successes, the "black box" problem still remains. Addressing this, we propose a novel framework, WWW, that offers the 'what', 'where', and 'why' of the neural network decisions in human-understandable terms. Specifically, WWW utilizes adaptive selection for concept discovery, employing adaptive cosine similarity and thresholding techniques to effectively explain 'what'. To address the 'where' and 'why', we proposed a novel combination of neuron activation maps (NAMs) with Shapley values, generating localized concept maps and heatmaps for individual inputs. Furthermore, WWW introduces a method for predicting uncertainty, leveraging heatmap similarities to estimate 'how' reliable the prediction is. Experimental evaluations of WWW demonstrate superior performance in both quantitative and qualitative metrics, outperforming existing methods in interpretability. WWW provides a unified solution for explaining 'what', 'where', and 'why', introducing a method for localized explanations from global interpretations and offering a plug-and-play solution adaptable to various architectures.

WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts

TL;DR

Abstract

Paper Structure (32 sections, 4 equations, 14 figures, 7 tables)

This paper contains 32 sections, 4 equations, 14 figures, 7 tables.

Introduction
Related Works
Neuron-Concept Assosiation
Vector-based Explanation
Method
Method Overview
Concept Discovery Module
Localization Module
Reasoning Module
Overall flow and Output of WWW
Experiment
Performance Evaluation for Concept Module
Qualitative Results
Quantitative Results
Ablation study
...and 17 more sections

Figures (14)

Figure 1: Overall flow of Concept Discovery module identifying concepts for a single neuron. We first calculate the cosine similarity of CLIP features between images and concepts with the template from the selected high-activating images. Then, we subtract the cosine similarity of CLIP features between images and the base template by only considering the similarity between the concept and image. From calculated adaptive cosine similarity (ACS), we generate concept score $S$ by the average similarity of images. Note that concept score $S=\{s_1,s_2,...,s_m\}$ are a group of scores, not a single scalar. From the calculated concept scores $S$, we select major concepts by adaptive selection. We also discover minor concepts using the same process but with crop images.
Figure 2: Illustration of overall test time flow of WWW. In the test time (i.e., inference), class explanation selects important neurons with pre-computed Shapley value of the predicted class. On the other hand, the sample explanation selects important neurons with a Shapley value for the input sample. With selected important neurons, pre-computed concepts are annotated. After the concept annotation, a class heatmap is generated with the pre-computed Shapley value of the predicted class. On the other hand, a sample heatmap is generated with the Shapley value of the sample input.
Figure 3: Qualitative comparison of WWW with other baselines. We compared WWW with three competitive baselines (CLIP-Dissect oikarinen2023clipdissect, MILAN MILAN, FALCON kalibhat2023identifying) in two final layer neurons and four penultimate layer (i.e., layer 4) neurons with each neuron's highly activating images. layer-4 neurons are top-$2$ important neurons of the final layer class. We have colored the descriptions green if they match the images, yellow if they match but are too generic or similar, and red if they do not match.
Figure 4: Ablation of concept sensitivity and heatmap similarity feasibility result. Left figure illustrates the F1 score with respect to Concept Sensitivity ($\alpha$) changes. Concept sensitivity ($\alpha$) that maximizes the F1-score is illustrated as the red line. The right figure illustrates the rejection test result of heatmap similarity and maximum softmax probability (MSP). # of hit denotes the number of correctly detected samples as a misprediction.
Figure 5: Example of generated explanation by WWW. From top to bottom, important neurons are displayed in the order of importance. Images in each neuron are examples of a major and minor concept, respectively. Colors in the top localization image show highly related regions for each concept. The bottom localization image is a weighted sum of important neuron activation maps displayed as a heatmap.
...and 9 more figures

WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts

TL;DR

Abstract

WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts

Authors

TL;DR

Abstract

Table of Contents

Figures (14)