Table of Contents
Fetching ...

Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

Thomas Fel, Binxu Wang, Michael A. Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep S. Lubana, Talia Konkle, Demba Ba, Martin Wattenberg

TL;DR

The paper investigates DINOv2 representations through the Linear Representation Hypothesis by constructing a 32k-atom dictionary via stable sparse autoencoders. It reveals task-specific subspaces, with classification relying on Elsewhere concepts, segmentation on border detectors, and depth from monocular cues, while uncovering a partly dense, anisotropic geometry and low-dimensional per-image token organization. Introducing the Minkowski Representation Hypothesis, it posits that tokens arise as convex mixtures of archetypes, yielding Minkowski sums of head polytopes produced by multi-head attention, and linking this to Gardenfors’ conceptual spaces. The work provides theoretical arguments and preliminary empirical signals that MRH can explain interpolative, region-based activation patterns and has implications for interpretability, suggesting steering toward archetypes and recognizing non-uniqueness in Minkowski decompositions. Overall, the study reframes concept representations from linear directions to convex regions assembled from archetypal landarks, offering a geometry-grounded lens to interpret vision transformer representations with practical visualization tools such as DinoVision.

Abstract

DINOv2 is routinely deployed to recognize objects, scenes, and actions; yet the nature of what it perceives remains unknown. As a working baseline, we adopt the Linear Representation Hypothesis (LRH) and operationalize it using SAEs, producing a 32,000-unit dictionary that serves as the interpretability backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits "Elsewhere" concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular depth cues matching visual neuroscience principles. Following these functional results, we analyze the geometry and statistics of the concepts learned by the SAE. We found that representations are partly dense rather than strictly sparse. The dictionary evolves toward greater coherence and departs from maximally orthogonal ideals (Grassmannian frames). Within an image, tokens occupy a low dimensional, locally connected set persisting after removing position. These signs suggest representations are organized beyond linear sparsity alone. Synthesizing these observations, we propose a refined view: tokens are formed by combining convex mixtures of archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). This structure is grounded in Gardenfors' conceptual spaces and in the model's mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis (MRH) and examine its empirical signatures and implications for interpreting vision-transformer representations.

Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

TL;DR

The paper investigates DINOv2 representations through the Linear Representation Hypothesis by constructing a 32k-atom dictionary via stable sparse autoencoders. It reveals task-specific subspaces, with classification relying on Elsewhere concepts, segmentation on border detectors, and depth from monocular cues, while uncovering a partly dense, anisotropic geometry and low-dimensional per-image token organization. Introducing the Minkowski Representation Hypothesis, it posits that tokens arise as convex mixtures of archetypes, yielding Minkowski sums of head polytopes produced by multi-head attention, and linking this to Gardenfors’ conceptual spaces. The work provides theoretical arguments and preliminary empirical signals that MRH can explain interpolative, region-based activation patterns and has implications for interpretability, suggesting steering toward archetypes and recognizing non-uniqueness in Minkowski decompositions. Overall, the study reframes concept representations from linear directions to convex regions assembled from archetypal landarks, offering a geometry-grounded lens to interpret vision transformer representations with practical visualization tools such as DinoVision.

Abstract

DINOv2 is routinely deployed to recognize objects, scenes, and actions; yet the nature of what it perceives remains unknown. As a working baseline, we adopt the Linear Representation Hypothesis (LRH) and operationalize it using SAEs, producing a 32,000-unit dictionary that serves as the interpretability backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits "Elsewhere" concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular depth cues matching visual neuroscience principles. Following these functional results, we analyze the geometry and statistics of the concepts learned by the SAE. We found that representations are partly dense rather than strictly sparse. The dictionary evolves toward greater coherence and departs from maximally orthogonal ideals (Grassmannian frames). Within an image, tokens occupy a low dimensional, locally connected set persisting after removing position. These signs suggest representations are organized beyond linear sparsity alone. Synthesizing these observations, we propose a refined view: tokens are formed by combining convex mixtures of archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). This structure is grounded in Gardenfors' conceptual spaces and in the model's mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis (MRH) and examine its empirical signatures and implications for interpreting vision-transformer representations.

Paper Structure

This paper contains 60 sections, 10 theorems, 36 equations, 28 figures.

Key Result

Lemma 1

Let one attention head have queries $\bm{Q}$, keys $\bm{K}$, values $\bm{V}=\{\bm{v}_1,\ldots,\bm{v}_m\}$, attention $\bm{A}=\bm{\sigma}(\bm{Q}\bm{K}^\top)$, and outputs $\bm{Y}=\bm{A}\bm{V}$ with attainable set $\mathcal{Y}$. Then $\mathcal{Y}\subseteq \operatorname{conv}(\bm{V})$. Moreover, every

Figures (28)

  • Figure 1: Overview of our study.Part I — Downstream usage. Different tasks recruit distinctive families of concepts: classification relies on "Elsewhere" detectors, segmentation on boundary concepts, depth estimation on three families of monocular cues, while token-specific concepts (e.g., registers) capture global scene factors such as illumination or motion blur. Part II — Geometry and statistics of concepts. Even though atoms are distributed as in the sparse-coding view, we also find anisotropy aligned with task subspaces, antipodal pairs forming signed axes, and partly dense structure: positional information compresses into 2D, yet locally connected neighborhoods persist even after removing position. Together, these signs suggest that representations are organised beyond linear sparsity alone. Part III — Towards Minkowski Geometry. Synthesizing these observations, we explore a refined view: token as sum of convex mixture. This view is grounded in Cognitive theory of Gärdenfors’ conceptual spaces as well as in the model’s own mechanism: each attention head produces convex combinations of value vectors, and their outputs add across heads; tokens can thus be understood as convex mixtures of a few archetypal landmarks (e.g., a rabbit among animals, brown among colors, fluffy among textures). This points to activations being organized as Minkowski sums of convex polytopes, with concepts arising as convex regions rather than linear directions. We finish by examining empirical signals of this geometry and its consequences for interpretability.
  • Figure 2: Concept importance across tasks. UMAP projection of the learned dictionary, with colors indicating the relative magnitude of each concept’s contribution to three downstream tasks: (Left) classification (ImageNet-1k), (Middle) segmentation, (Right) depth estimation. While classification recruits a broad set of concepts, segmentation and depth primarily activate more restricted set of concepts. Although UMAP only preserve local geometry, functionally relevant groupings are visibly clustered in the projection. We show in later sections that different tasks consistently recruit distinct, low-dimensional regions of the concept space.
  • Figure 3: (Left) Classification recruits more concepts than segmentation than depth. Classification utilizes a larger fraction of the dictionary compared to segmentation and depth, likely reflecting the higher rank of the classification head. This supports the view that task complexity and output dimensionality shape the breadth of concept recruitment. (Middle) Intra-task concept similarity. Cosine similarity histograms of the top 100 most important concepts per task, compared to random subsets of the dictionary. Intra-task concept pairs exhibit higher mutual alignment, deviating from the quasi-orthogonality expected of generic dictionary atoms. This suggests that functional concepts form more coherent subspaces. (Right) Spectral analysis of task-specific subspaces. Singular value spectra of the top-100 task-relevant concepts reveal sharply decaying profiles for all tasks (especially segmentation and depth) indicating that each task activates a low-dimensional functional subspace. Compared to random concept subsets, task-aligned subspaces exhibit stronger concentration, supporting a "functional subspace" hypothesis.
  • Figure 4: "Elsewhere" concepts reflect off-object activation conditioned on object presence. Visualization of a recurring concepts pattern, consistently among the top-3 most important concepts for several ImageNet classes (rows: rabbit, fox, cat), using token-level attribution (middle row) and causal masking petsiuk2018rise (bottom row). These "Elsewhere" concepts consistently activate in tokens disjoint from the object, yet their presence is conditional on the object itself being present elsewhere in the image: they vanish when the object is removed. Rather than capturing background texture, they express a structured logical relation: "not the object, but the object exists". This suggests that DINO implicitly learns a form of fuzzy spatial logic, distributing class evidence across both object-centric and off-object tokens. See \ref{['app:elsewhere']} for more details.
  • Figure 5: Segmentation relies on spatially localized border concepts. Examples of the most important concepts across segmentation tasks, visualized via token attribution (colored overlays). Most of these concepts activate along object boundaries, whether biological (e.g., limbs, heads) or architectural (e.g., domes, rooflines). Despite differences in content, these border concepts exhibit consistent spatial patterns and nontrivial similarity in embedding space (right), suggesting a shared functional role and a possibly low-dimensional submanifold within the concept geometry.
  • ...and 23 more figures

Theorems & Definitions (18)

  • Definition 1
  • Definition 2
  • Lemma 1: Single head yields a convex polytope and matches MRH for $|S|=1$
  • Lemma 2: Affine transformations preserve MRH structure
  • Proposition 1: Multi-head attention realizes MRH
  • Proposition 2: Non-identifiability of Minkowski decomposition
  • Definition 3
  • Lemma 3: Single head creates convex polytopes
  • proof
  • Lemma 4: Affine transformations preserve MRH structure
  • ...and 8 more