Table of Contents
Fetching ...

CoSy: Evaluating Textual Explanations of Neurons

Laura Kopf, Philine Lou Bommer, Anna Hedström, Sebastian Lapuschkin, Marina M. -C. Höhne, Kirill Bykov

TL;DR

CoSy presents the first automatic framework for quantitatively evaluating open-vocabulary textual explanations of neurons by translating explanations into synthetic images and comparing neuron activations against a control distribution. The method, grounded in three steps (generate, measure, and score), uses AUC and MAD to assess how well explanations align with neuronal behavior across architectures and datasets. Through extensive sanity checks and cross-method benchmarking, CoSy reveals substantial variability in explanation quality, with higher-layer concepts generally better explained and INVERT/CLIP-Dissect often outperforming MILAN and FALCON. The approach offers a scalable, architecture-agnostic means to benchmark explanations, highlighting practical implications for interpretability research and urging cautious interpretation of explanations in lower layers and with abstract concepts.

Abstract

A crucial aspect of understanding the complex nature of Deep Neural Networks (DNNs) is the ability to explain learned concepts within their latent representations. While methods exist to connect neurons to human-understandable textual descriptions, evaluating the quality of these explanations is challenging due to the lack of a unified quantitative approach. We introduce CoSy (Concept Synthesis), a novel, architecture-agnostic framework for evaluating textual explanations of latent neurons. Given textual explanations, our proposed framework uses a generative model conditioned on textual input to create data points representing the explanations. By comparing the neuron's response to these generated data points and control data points, we can estimate the quality of the explanation. We validate our framework through sanity checks and benchmark various neuron description methods for Computer Vision tasks, revealing significant differences in quality.

CoSy: Evaluating Textual Explanations of Neurons

TL;DR

CoSy presents the first automatic framework for quantitatively evaluating open-vocabulary textual explanations of neurons by translating explanations into synthetic images and comparing neuron activations against a control distribution. The method, grounded in three steps (generate, measure, and score), uses AUC and MAD to assess how well explanations align with neuronal behavior across architectures and datasets. Through extensive sanity checks and cross-method benchmarking, CoSy reveals substantial variability in explanation quality, with higher-layer concepts generally better explained and INVERT/CLIP-Dissect often outperforming MILAN and FALCON. The approach offers a scalable, architecture-agnostic means to benchmark explanations, highlighting practical implications for interpretability research and urging cautious interpretation of explanations in lower layers and with abstract concepts.

Abstract

A crucial aspect of understanding the complex nature of Deep Neural Networks (DNNs) is the ability to explain learned concepts within their latent representations. While methods exist to connect neurons to human-understandable textual descriptions, evaluating the quality of these explanations is challenging due to the lack of a unified quantitative approach. We introduce CoSy (Concept Synthesis), a novel, architecture-agnostic framework for evaluating textual explanations of latent neurons. Given textual explanations, our proposed framework uses a generative model conditioned on textual input to create data points representing the explanations. By comparing the neuron's response to these generated data points and control data points, we can estimate the quality of the explanation. We validate our framework through sanity checks and benchmark various neuron description methods for Computer Vision tasks, revealing significant differences in quality.
Paper Structure (38 sections, 10 equations, 11 figures, 8 tables)

This paper contains 38 sections, 10 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: A schematic illustration of the CoSy evaluation framework for Neuron 80 in ResNet18’s avgpool layer. The current challenge lies in the absence of general-purpose, quantitative evaluation measures to benchmark textual explanations of neurons. To address this, we propose CoSy, a framework consisting of three steps: first, a generative model translates textual concepts into the visual domain, creating synthetic images for each explanation using a text-to-image model. Then, inference is performed on these synthetic images alongside a control image dataset to collect neuron activations. Finally, by comparing activations from the synthetic images with those from the control dataset, we quantitatively assess the quality of the textual explanation and compare results across different explanation methods. The implementation details of this example can be found in Appendix \ref{['sec:graph-details']}.
  • Figure 2: An overview of the impact of varying the prompt on the similarity between natural and synthetic images, using two text-to-image models. Left: average Cosine Similarity (CS) across all natural and synthetic images over all classes are reported. Higher CS values are better, indicating greater similarity between the images. Right: an illustration of the visual differences produced by the SDXL and SC models in response to diverse prompts for the concept "submarine", and natural images from the ImageNet validation dataset russakovsky2015imagenet. Our results show that both SDXL and SC generate similar images, with SDXL generally being more closely aligned with natural images than SC.
  • Figure 3: An overview of analyses performed to study the similarity between natural and synthetic images. From left to right: (a) an overview of MAD scores between synthetic and natural image activations of the output neuron's ground truth classes for each model studied in this work, (b) activations collected for neuron $504$ in ResNet18 for the class "coffee mug", showcasing the difference between the natural and synthetic distributions and (c) examples of natural versus synthetic images. In both analyses, we observe a substantial overlap in the activations of synthetic and natural images, suggesting that the models respond similarly to both types of images.
  • Figure 4: A comparison of how different explanation methods vary in their quality, as measured by (a) AUC and (b) MAD, across different layers in ResNet18. INVERT and CLIP-Dissect maintain high AUC and MAD scores across all layers, while MILAN and FALCON have lower scores. Overall, performance declines in the lower layers for all methods.
  • Figure 5: A qualitative example, of neuron explanations across four neurons. The first four panels include the textual explanation across INVERT, FALCON, CLIP-Dissect, and MILAN alongside three corresponding generated images. The respective AUC and MAD scores are reported below each panel. The last panel shows the activation distributions across $50$ generated images for each method and the distribution of the control data.
  • ...and 6 more figures