CoSy: Evaluating Textual Explanations of Neurons
Laura Kopf, Philine Lou Bommer, Anna Hedström, Sebastian Lapuschkin, Marina M. -C. Höhne, Kirill Bykov
TL;DR
CoSy presents the first automatic framework for quantitatively evaluating open-vocabulary textual explanations of neurons by translating explanations into synthetic images and comparing neuron activations against a control distribution. The method, grounded in three steps (generate, measure, and score), uses AUC and MAD to assess how well explanations align with neuronal behavior across architectures and datasets. Through extensive sanity checks and cross-method benchmarking, CoSy reveals substantial variability in explanation quality, with higher-layer concepts generally better explained and INVERT/CLIP-Dissect often outperforming MILAN and FALCON. The approach offers a scalable, architecture-agnostic means to benchmark explanations, highlighting practical implications for interpretability research and urging cautious interpretation of explanations in lower layers and with abstract concepts.
Abstract
A crucial aspect of understanding the complex nature of Deep Neural Networks (DNNs) is the ability to explain learned concepts within their latent representations. While methods exist to connect neurons to human-understandable textual descriptions, evaluating the quality of these explanations is challenging due to the lack of a unified quantitative approach. We introduce CoSy (Concept Synthesis), a novel, architecture-agnostic framework for evaluating textual explanations of latent neurons. Given textual explanations, our proposed framework uses a generative model conditioned on textual input to create data points representing the explanations. By comparing the neuron's response to these generated data points and control data points, we can estimate the quality of the explanation. We validate our framework through sanity checks and benchmark various neuron description methods for Computer Vision tasks, revealing significant differences in quality.
