Concept Boundary Vectors
Thomas Walker
TL;DR
Problem: interpret latent representations of concepts in neural networks beyond simple objectives. Approach: introduce concept boundary vectors (CBVs) constructed from boundary normals ${\cal N}_{\pm}$ and optimized to align with those directions, contrasting with concept activation vectors (CAVs). Contributions: empirical evidence that CBVs better capture concept relationships, show higher logit influence $LI$ on target classes, and exhibit stronger topological coherence via persistent homology and Mapper analyses. Significance: ties between boundary geometry and latent space homogeneity, measured via Euclidicity, support CBVs as a more faithful interpretability tool for real-world models.
Abstract
Machine learning models are trained with relatively simple objectives, such as next token prediction. However, on deployment, they appear to capture a more fundamental representation of their input data. It is of interest to understand the nature of these representations to help interpret the model's outputs and to identify ways to improve the salience of these representations. Concept vectors are constructions aimed at attributing concepts in the input data to directions, represented by vectors, in the model's latent space. In this work, we introduce concept boundary vectors as a concept vector construction derived from the boundary between the latent representations of concepts. Empirically we demonstrate that concept boundary vectors capture a concept's semantic meaning, and we compare their effectiveness against concept activation vectors.
