A Geometric Unification of Concept Learning with Concept Cones

Alexandre Rocchi--Henry; Thomas Fel; Gianni Franchi

A Geometric Unification of Concept Learning with Concept Cones

Alexandre Rocchi--Henry, Thomas Fel, Gianni Franchi

TL;DR

This work unifies supervised and unsupervised concept learning by casting both CBMs and SAEs as dictionary-learning problems that generate a nonnegative cone of concept directions in activation space. By defining concept cones and a containment framework, the authors introduce quantitative metrics to evaluate how well SAE-derived dictionaries align with human-aligned CBM concepts, enabling principled assessment of inductive biases such as sparsity and expansion. Empirical results show that certain SAE variants (BatchTopK, Archetypal) and intermediate sparsity/expansion settings best approximate CBM concepts, and that deeper network layers yield stronger semantic alignment with CBMs. The paper thus provides actionable, geometry-grounded guidance for harmonizing supervision and discovery to obtain scalable, interpretable representations in large models.

Abstract

Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which prescribe what a concept should be, and Sparse Autoencoders (SAEs), which discover what concepts emerge. While CBMs use supervision to align activations with human-labeled concepts, SAEs rely on sparse coding to uncover emergent ones. We show that both paradigms instantiate the same geometric structure: each learns a set of linear directions in activation space whose nonnegative combinations form a concept cone. Supervised and unsupervised methods thus differ not in kind but in how they select this cone. Building on this view, we propose an operational bridge between the two paradigms. CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs. This containment framework yields quantitative metrics linking inductive biases -- such as SAE type, sparsity, or expansion ratio -- to emergence of plausible\footnote{We adopt the terminology of \citet{jacovi2020towards}, who distinguish between faithful explanations (accurately reflecting model computations) and plausible explanations (aligning with human intuition and domain knowledge). CBM concepts are plausible by construction -- selected or annotated by humans -- though not necessarily faithful to the true latent factors that organise the data manifold.} concepts. Using these metrics, we uncover a ``sweet spot'' in both sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts. Overall, our work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess how well discovered concept align with plausible human concepts.

A Geometric Unification of Concept Learning with Concept Cones

TL;DR

Abstract

A Geometric Unification of Concept Learning with Concept Cones

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)

Theorems & Definitions (3)