Table of Contents
Fetching ...

Concept Boundary Vectors

Thomas Walker

TL;DR

Problem: interpret latent representations of concepts in neural networks beyond simple objectives. Approach: introduce concept boundary vectors (CBVs) constructed from boundary normals ${\cal N}_{\pm}$ and optimized to align with those directions, contrasting with concept activation vectors (CAVs). Contributions: empirical evidence that CBVs better capture concept relationships, show higher logit influence $LI$ on target classes, and exhibit stronger topological coherence via persistent homology and Mapper analyses. Significance: ties between boundary geometry and latent space homogeneity, measured via Euclidicity, support CBVs as a more faithful interpretability tool for real-world models.

Abstract

Machine learning models are trained with relatively simple objectives, such as next token prediction. However, on deployment, they appear to capture a more fundamental representation of their input data. It is of interest to understand the nature of these representations to help interpret the model's outputs and to identify ways to improve the salience of these representations. Concept vectors are constructions aimed at attributing concepts in the input data to directions, represented by vectors, in the model's latent space. In this work, we introduce concept boundary vectors as a concept vector construction derived from the boundary between the latent representations of concepts. Empirically we demonstrate that concept boundary vectors capture a concept's semantic meaning, and we compare their effectiveness against concept activation vectors.

Concept Boundary Vectors

TL;DR

Problem: interpret latent representations of concepts in neural networks beyond simple objectives. Approach: introduce concept boundary vectors (CBVs) constructed from boundary normals and optimized to align with those directions, contrasting with concept activation vectors (CAVs). Contributions: empirical evidence that CBVs better capture concept relationships, show higher logit influence on target classes, and exhibit stronger topological coherence via persistent homology and Mapper analyses. Significance: ties between boundary geometry and latent space homogeneity, measured via Euclidicity, support CBVs as a more faithful interpretability tool for real-world models.

Abstract

Machine learning models are trained with relatively simple objectives, such as next token prediction. However, on deployment, they appear to capture a more fundamental representation of their input data. It is of interest to understand the nature of these representations to help interpret the model's outputs and to identify ways to improve the salience of these representations. Concept vectors are constructions aimed at attributing concepts in the input data to directions, represented by vectors, in the model's latent space. In this work, we introduce concept boundary vectors as a concept vector construction derived from the boundary between the latent representations of concepts. Empirically we demonstrate that concept boundary vectors capture a concept's semantic meaning, and we compare their effectiveness against concept activation vectors.

Paper Structure

This paper contains 33 sections, 7 equations, 21 figures, 1 table, 1 algorithm.

Figures (21)

  • Figure 1: Figure \ref{['fig:2d_concept_vector_loss_variability']} shows the variability in the loss as a concept vector trained two-dimensional data is rotated. Figure \ref{['fig:3d_concept_vector_loss_variability']} shows the variability in the loss as a concept vector trained on three-dimensional data is rotated.
  • Figure 2: Figure \ref{['fig:logit_influence_on_target']} shows the influence of the concept vectors on the logit of the target concept. Figure \ref{['fig:logit_influence_on_source']} shows the influence of the concept vectors on the logit of the source concept. The dashed line is the line where the influence of the vectors is equal.
  • Figure 3: The cosine similarities between concept vectors with the concept $\mathsf{0}$ as their target.
  • Figure 4: The persistence diagrams obtained from geodesic-based filtrations of concept activation vectors and concept boundary vectors.
  • Figure 5: Mapper plots obtained from concept activation vectors and concept boundary vectors. The size of the dots represents the number of concept vectors in the cluster represented by the node. The colour of the nodes is the average cosine similarity between the concept nodes within the cluster represented by the node.
  • ...and 16 more figures