From Neural Activations to Concepts: A Survey on Explaining Concepts in Neural Networks

Jae Hee Lee; Sergio Lanza; Stefan Wermter

From Neural Activations to Concepts: A Survey on Explaining Concepts in Neural Networks

Jae Hee Lee, Sergio Lanza, Stefan Wermter

TL;DR

Neural networks are powerful but often opaque, motivating a survey of methods to explain concepts learned by networks and thereby bridge learning with reasoning. The paper categorizes explanations into neuron-level and layer-level approaches, detailing similarity-based dissection, causal analyses, Concept Activation Vectors (CAVs), probing, and Concept Bottleneck Models (CBMs), with examples like MILAN and CLIP-based mappings. It highlights how these methods support neuro-symbolic AI by exposing or injecting concepts, enabling grounded explanations, debugging, and potential symbolic reasoning. Overall, the survey maps a rapidly evolving landscape toward more transparent and controllable AI systems and emphasizes the need for empirical comparisons and integration across concept-explanation techniques.

Abstract

In this paper, we review recent approaches for explaining concepts in neural networks. Concepts can act as a natural link between learning and reasoning: once the concepts are identified that a neural learning system uses, one can integrate those concepts with a reasoning system for inference or use a reasoning system to act upon them to improve or enhance the learning system. On the other hand, knowledge can not only be extracted from neural networks but concept knowledge can also be inserted into neural network architectures. Since integrating learning and reasoning is at the core of neuro-symbolic AI, the insights gained from this survey can serve as an important step towards realizing neuro-symbolic AI based on explainable concepts.

From Neural Activations to Concepts: A Survey on Explaining Concepts in Neural Networks

TL;DR

Abstract

Paper Structure (10 sections, 5 figures)

This paper contains 10 sections, 5 figures.

Introduction
Neuron-Level Explanations
Using Similarities between Concepts and Activations
Using Causal Relationships between Concepts and Activations
Layer-Level Explanations
Using Vectors to Explain Concepts: Concept Activation Vectors
Using Classifiers to Explain Concepts: Probing
Using Localist Representations: Concept Bottleneck Models
Conclusion
Acknowledgement

Figures (5)

Figure 1: Neuron-level explanation using similarities between concepts and activations. Depicted is the network dissection approach, which compares the segmented concept in the input with the activation mask of a neuron bau_network_2017.
Figure 2: Neuron-level explanation using causal relationships between concepts and activations. In causal mediation analysis, the activation of a neuron is modified to the one that the neuron would have output if there was an intervention on the input (the subject in the input sentence was changed from singular to plural). Afterward, the amount of change between the predictions of the correct conjugation of a verb with and without the intervention is measured finlayson_causal_2021.
Figure 3: Layer-level explanation using vectors to explain concepts. For each concept $C$ positive examples $x_{C}^{+}$ and negative examples $x_{C}^{-}$ are fed to a pre-trained model to learn the so-called concept activation vector (CAV) $v_{C}$ from the corresponding activations of the target layer kim_interpretability_2018.
Figure 4: Layer-level explanation using a classifier to explain concepts. In this example, a pre-trained model takes as its input a sentence and a probing classifier is applied to the activation of the highlighted layer to check whether the activation encodes the concept of sentence length adi_fine-grained_2016.
Figure 5: Layer-level explanation using the concept bottleneck model approach koh_concept_2020. Each neuron in the concept bottleneck $f^{\ell}$ corresponds to a unique concept (e.g., wing color).

From Neural Activations to Concepts: A Survey on Explaining Concepts in Neural Networks

TL;DR

Abstract

From Neural Activations to Concepts: A Survey on Explaining Concepts in Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)