WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts
Yong Hyun Ahn, Hyeon Bae Kim, Seong Tae Kim
TL;DR
WWW presents a unified, plug-and-play framework for neural network interpretability that simultaneously explains what concepts neurons represent, where those concepts manifest in input regions, and why decisions are made, using Adaptive Cosine Similarity for concept discovery, Neuron Activation Maps for localization, and Shapley-value based reasoning. The approach yields class- and sample-level explanations with localized concept maps and heatmaps, enabling uncertainty estimation via heatmap similarity and offering robust, interpretable insights across CNNs and Vision Transformers. Key contributions include (i) a concept discovery module that identifies major/minor neuron concepts, (ii) a localization module that produces targeted concept heatmaps, and (iii) a reasoning module that uses class-wise Shapley contributions to explain and validate predictions, all in a model-agnostic, plug-and-play fashion. Empirical results show WWW outperforms baselines on both qualitative interpretability and quantitative metrics, while ablations demonstrate the importance of ACS and carefully tuned concept sensitivity, underscoring the practical impact for trustworthy AI in diverse architectures.
Abstract
Recent advancements in neural networks have showcased their remarkable capabilities across various domains. Despite these successes, the "black box" problem still remains. Addressing this, we propose a novel framework, WWW, that offers the 'what', 'where', and 'why' of the neural network decisions in human-understandable terms. Specifically, WWW utilizes adaptive selection for concept discovery, employing adaptive cosine similarity and thresholding techniques to effectively explain 'what'. To address the 'where' and 'why', we proposed a novel combination of neuron activation maps (NAMs) with Shapley values, generating localized concept maps and heatmaps for individual inputs. Furthermore, WWW introduces a method for predicting uncertainty, leveraging heatmap similarities to estimate 'how' reliable the prediction is. Experimental evaluations of WWW demonstrate superior performance in both quantitative and qualitative metrics, outperforming existing methods in interpretability. WWW provides a unified solution for explaining 'what', 'where', and 'why', introducing a method for localized explanations from global interpretations and offering a plug-and-play solution adaptable to various architectures.
