Table of Contents
Fetching ...

A Geometric Unification of Concept Learning with Concept Cones

Alexandre Rocchi--Henry, Thomas Fel, Gianni Franchi

TL;DR

This work unifies supervised and unsupervised concept learning by casting both CBMs and SAEs as dictionary-learning problems that generate a nonnegative cone of concept directions in activation space. By defining concept cones and a containment framework, the authors introduce quantitative metrics to evaluate how well SAE-derived dictionaries align with human-aligned CBM concepts, enabling principled assessment of inductive biases such as sparsity and expansion. Empirical results show that certain SAE variants (BatchTopK, Archetypal) and intermediate sparsity/expansion settings best approximate CBM concepts, and that deeper network layers yield stronger semantic alignment with CBMs. The paper thus provides actionable, geometry-grounded guidance for harmonizing supervision and discovery to obtain scalable, interpretable representations in large models.

Abstract

Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which prescribe what a concept should be, and Sparse Autoencoders (SAEs), which discover what concepts emerge. While CBMs use supervision to align activations with human-labeled concepts, SAEs rely on sparse coding to uncover emergent ones. We show that both paradigms instantiate the same geometric structure: each learns a set of linear directions in activation space whose nonnegative combinations form a concept cone. Supervised and unsupervised methods thus differ not in kind but in how they select this cone. Building on this view, we propose an operational bridge between the two paradigms. CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs. This containment framework yields quantitative metrics linking inductive biases -- such as SAE type, sparsity, or expansion ratio -- to emergence of plausible\footnote{We adopt the terminology of \citet{jacovi2020towards}, who distinguish between faithful explanations (accurately reflecting model computations) and plausible explanations (aligning with human intuition and domain knowledge). CBM concepts are plausible by construction -- selected or annotated by humans -- though not necessarily faithful to the true latent factors that organise the data manifold.} concepts. Using these metrics, we uncover a ``sweet spot'' in both sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts. Overall, our work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess how well discovered concept align with plausible human concepts.

A Geometric Unification of Concept Learning with Concept Cones

TL;DR

This work unifies supervised and unsupervised concept learning by casting both CBMs and SAEs as dictionary-learning problems that generate a nonnegative cone of concept directions in activation space. By defining concept cones and a containment framework, the authors introduce quantitative metrics to evaluate how well SAE-derived dictionaries align with human-aligned CBM concepts, enabling principled assessment of inductive biases such as sparsity and expansion. Empirical results show that certain SAE variants (BatchTopK, Archetypal) and intermediate sparsity/expansion settings best approximate CBM concepts, and that deeper network layers yield stronger semantic alignment with CBMs. The paper thus provides actionable, geometry-grounded guidance for harmonizing supervision and discovery to obtain scalable, interpretable representations in large models.

Abstract

Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which prescribe what a concept should be, and Sparse Autoencoders (SAEs), which discover what concepts emerge. While CBMs use supervision to align activations with human-labeled concepts, SAEs rely on sparse coding to uncover emergent ones. We show that both paradigms instantiate the same geometric structure: each learns a set of linear directions in activation space whose nonnegative combinations form a concept cone. Supervised and unsupervised methods thus differ not in kind but in how they select this cone. Building on this view, we propose an operational bridge between the two paradigms. CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs. This containment framework yields quantitative metrics linking inductive biases -- such as SAE type, sparsity, or expansion ratio -- to emergence of plausible\footnote{We adopt the terminology of \citet{jacovi2020towards}, who distinguish between faithful explanations (accurately reflecting model computations) and plausible explanations (aligning with human intuition and domain knowledge). CBM concepts are plausible by construction -- selected or annotated by humans -- though not necessarily faithful to the true latent factors that organise the data manifold.} concepts. Using these metrics, we uncover a ``sweet spot'' in both sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts. Overall, our work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess how well discovered concept align with plausible human concepts.

Paper Structure

This paper contains 62 sections, 28 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of the concept learning spaces. Each image $\bm{x} \in \mathcal{X}$ comes from hidden factors $\bm{c} \in \mathcal{C}$ through a generative process $\bm{G}:\mathcal{C}\!\to\!\mathcal{X}$. A neural network $\bm{f}:\mathcal{X}\!\to\!\mathcal{A}$ maps inputs to activations $\bm{a}=\bm{f}(\bm{x})$. Concept extraction aims to find a function $\bm{H}:\mathcal{A}\!\to\!\mathcal{C}$ that recovers these hidden factors. Here, $\mathcal{C}$ is the concept space with interpretable axes (e.g., rabbit, tree), $\mathcal{X}$ is the input space of images, and $\mathcal{A}$ is the activation space where concepts can be linearly decoded.
  • Figure 2: CBMs as anchors for unsupervised concept discovery. Both supervised and unsupervised methods learn linear directions in activation space $\mathcal{A}$, but they do so under different objectives. Unsupervised approaches such as SAEs decompose activations $\bm{A}$ into a dictionary $\bm{D}$ and sparse codes $\bm{Z}$, uncovering emergent concept directions without human supervision. CBMs, in contrast, learn a set of concept directions $\mathbf{W}_c$ aligned with annotated or language-derived concepts that are known to be meaningful to humans. This contrast raises a central question: do the directions discovered by an SAE contain, or at least sparsely approximate, the human-desirable directions encoded by a CBM? Geometrically, this amounts to testing whether the supervised concept cone $\mathcal{C}_{\mathbf{W}_c}=\{\mathbf{W}_c^\top \bm{\beta}\!:\!\bm{\beta}\!\ge\!0\}$ is included in the unsupervised cone $\mathcal{C}_{\bm{D}}=\{\bm{D}^\top \bm{\alpha}\!:\!\bm{\alpha}\!\ge\!0\}$. If so, the SAE has discovered a basis rich enough to represent human-aligned semantics through simple nonnegative combinations of its atoms. This viewpoint reframes CBMs not as competitors to SAEs but as geometric reference systems: they define desirable regions of concept space whose containment within the SAE cone quantifies the extent to which unsupervised discovery find plausible jacovi2020towards concepts.
  • Figure 3: Radar plot illustrating the influence of the (a) sparsity target , (b) expansion factor, and (c) features level on TopK Sparse Autoencoder (SAE) performance for both CUB and ImageNet datasets. Each axis corresponds to a normalized metric.
  • Figure A.4: Cosine similarities between selected CBM concepts and all SAE dictionary atoms, visualized through UMAP. Strong responses concentrate around a limited number of atoms, suggesting that only a small fraction of SAE directions encode concepts shared with the CBM.
  • Figure A.5: Class-wise concept histograms for Husky and Wolf images. Each sample is assigned the concept with maximum correlation to its SAE representation, revealing clear concept-frequency biases between the two classes.

Theorems & Definitions (3)

  • Definition 1: Concept as inversion of latent factors
  • Definition 2: Concept Cone
  • proof