Table of Contents
Fetching ...

Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks

Alexandros Doumanoglou, Kurt Driessens, Dimitrios Zarpalas

TL;DR

This work tackles the opacity of deep vision models by proposing an unsupervised method to uncover encoding-decoding direction pairs that embed and read concepts in latent space. Grounded in a linear representation hypothesis, it jointly learns concept encoding directions (signal vectors) and decoding directions (filters) through a multi-concept data model and uncertainty-driven losses, notably Uncertainty Region Alignment and augmented Lagrangian optimization. The approach yields interpretable, monosemantic concept detectors, enables both global and local explanations, and supports interventions and counterfactuals, demonstrated on synthetic data and multiple real-world CNNs. Across architectures and datasets, EDPP achieves strong clustering quality, interpretable concept detectors, and meaningful influence mappings, while revealing a trade-off between interpretability and network influence. The framework offers a path toward deeper mechanistic understanding and practical model debugging without requiring annotations or altering model training.

Abstract

Empirical evidence shows that deep vision networks often represent concepts as directions in latent space with concept information written along directional components in the vector representation of the input. However, the mechanism to encode (write) and decode (read) concept information to and from vector representations is not directly accessible as it constitutes a latent mechanism that naturally emerges from the training process of the network. Recovering this mechanism unlocks significant potential to open the black-box nature of deep networks, enabling understanding, debugging, and improving deep learning models. In this work, we propose an unsupervised method to recover this mechanism. For each concept, we explain that under the hypothesis of linear concept representations, this mechanism can be implemented with the help of two directions: the first facilitating encoding of concept information and the second facilitating decoding. Unlike prior matrix decomposition, autoencoder, or dictionary learning methods that rely on feature reconstruction, we propose a new perspective: decoding directions are identified via directional clustering of activations, and encoding directions are estimated with signal vectors under a probabilistic view. We further leverage network weights through a novel technique, Uncertainty Region Alignment, which reveals interpretable directions affecting predictions. Our analysis shows that (a) on synthetic data, our method recovers ground-truth direction pairs; (b) on real data, decoding directions map to monosemantic, interpretable concepts and outperform unsupervised baselines; and (c) signal vectors faithfully estimate encoding directions, validated via activation maximization. Finally, we demonstrate applications in understanding global model behavior, explaining individual predictions, and intervening to produce counterfactuals or correct errors.

Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks

TL;DR

This work tackles the opacity of deep vision models by proposing an unsupervised method to uncover encoding-decoding direction pairs that embed and read concepts in latent space. Grounded in a linear representation hypothesis, it jointly learns concept encoding directions (signal vectors) and decoding directions (filters) through a multi-concept data model and uncertainty-driven losses, notably Uncertainty Region Alignment and augmented Lagrangian optimization. The approach yields interpretable, monosemantic concept detectors, enables both global and local explanations, and supports interventions and counterfactuals, demonstrated on synthetic data and multiple real-world CNNs. Across architectures and datasets, EDPP achieves strong clustering quality, interpretable concept detectors, and meaningful influence mappings, while revealing a trade-off between interpretability and network influence. The framework offers a path toward deeper mechanistic understanding and practical model debugging without requiring annotations or altering model training.

Abstract

Empirical evidence shows that deep vision networks often represent concepts as directions in latent space with concept information written along directional components in the vector representation of the input. However, the mechanism to encode (write) and decode (read) concept information to and from vector representations is not directly accessible as it constitutes a latent mechanism that naturally emerges from the training process of the network. Recovering this mechanism unlocks significant potential to open the black-box nature of deep networks, enabling understanding, debugging, and improving deep learning models. In this work, we propose an unsupervised method to recover this mechanism. For each concept, we explain that under the hypothesis of linear concept representations, this mechanism can be implemented with the help of two directions: the first facilitating encoding of concept information and the second facilitating decoding. Unlike prior matrix decomposition, autoencoder, or dictionary learning methods that rely on feature reconstruction, we propose a new perspective: decoding directions are identified via directional clustering of activations, and encoding directions are estimated with signal vectors under a probabilistic view. We further leverage network weights through a novel technique, Uncertainty Region Alignment, which reveals interpretable directions affecting predictions. Our analysis shows that (a) on synthetic data, our method recovers ground-truth direction pairs; (b) on real data, decoding directions map to monosemantic, interpretable concepts and outperform unsupervised baselines; and (c) signal vectors faithfully estimate encoding directions, validated via activation maximization. Finally, we demonstrate applications in understanding global model behavior, explaining individual predictions, and intervening to produce counterfactuals or correct errors.

Paper Structure

This paper contains 80 sections, 43 equations, 52 figures, 30 tables, 1 algorithm.

Figures (52)

  • Figure 1: Concept Encoding - Decoding under the Linear Representation Hypothesis: Deep networks encode high-level concepts, such as sky or boat, in distinct directions of their latent space, respectively ${\bm{s}}_1$ and ${\bm{s}}_2$. The illustration shows the encoding of two concept latent factors (i.e., the degree of concept presence) $\alpha^1, \alpha^2$ within a patch's representation ${\bm{x}}_{{\bm{p}}}$ by utilizing the concept embedding directions ${\bm{s}}$. Additionally, it demonstrates how a filter${\bm{w}}_1$ can be employed to extract one of these latent factors from the representation. The illustration omits depicting the latent space bias for brevity. We use the terms {encoding direction, concept embedding, signal direction} and the terms {filter, decoding direction} interchangeably throughout this article.
  • Figure 2: Our Encoding-Decoding Direction Pairs (EDDP) powers a range of applications, highlighting both the generality and the precision of our approach. The figure summarizes applications that we selected to discuss in this work.
  • Figure 3: The core concept of Unsupervised Interpretable Basis Extraction (UIBE UIBE) is to learn a set of concept detectors, which are essentially binary linear classifiers with learnable filters and biases. These detectors aim to transform feature representations to the soft-binary vector space of concepts in which the newly transformed representations are sparse. In this procedure the input to the method consists of image representations coming from an unlabeled concept dataset. Identifying the concept name behind each detector is done in a post-processing step with a procedure we refer to as Direction Labeling.
  • Figure 4: The proposed method analyzes the latent space to uncover its directional structure. Because many concepts are naturally encoded as specific directions, this process often reveals the encoding-decoding mechanism of meaningful, monosemantic, and highly interpretable concepts. The figure depicts an overview of the method's components. $\mathcal{L}$ denotes loss terms. Purple indicates contributions of this work, while light gray indicates loss terms from UIBECBE.
  • Figure 5: Left: The learnable parameters of the method $\hat{{\bm{S}}},{\bm{W}},{\bm{b}}$ and intermediate variables $\mathbf{z}_{{\bm{p}}}, {\bm{y}}_{{\bm{p}}}, {\bm{q}}_{{\bm{p}}}$. Top Right: Feature manipulation and Uncertainty Region Alignment. Bottom Right: Loss terms $\mathcal{L}$ with their dependencies. Purple indicates loss contributions of this work, while light gray indicates loss terms from UIBECBE.
  • ...and 47 more figures