Table of Contents
Fetching ...

The Geometry of Categorical and Hierarchical Concepts in Large Language Models

Kiho Park, Yo Joong Choe, Yibo Jiang, Victor Veitch

TL;DR

This work generalizes the linear representation hypothesis from simple binary directions to vector and polytope representations, linking semantic hierarchy to orthogonality in LLM representation spaces. By unifying embedding and unembedding spaces via a causal inner product, it defines vector representations for binary features and polytope representations for categorical concepts, and proves that hierarchical relations manifest as orthogonal subspaces. Empirically, the authors validate the theory on Gemma and LLaMA-3 using WordNet, showing that WordNet hierarchies are linearly represented and that manipulations along feature-vectors affect target concepts without disturbing off-target ones. The findings offer a foundational, geometrically grounded lens for interpreting LLM semantics and point toward hierarchy-aware interpretability tools and future work on internal-layer geometry.

Abstract

The linear representation hypothesis is the informal idea that semantic concepts are encoded as linear directions in the representation spaces of large language models (LLMs). Previous work has shown how to make this notion precise for representing binary concepts that have natural contrasts (e.g., {male, female}) as directions in representation space. However, many natural concepts do not have natural contrasts (e.g., whether the output is about an animal). In this work, we show how to extend the formalization of the linear representation hypothesis to represent features (e.g., is_animal) as vectors. This allows us to immediately formalize the representation of categorical concepts as polytopes in the representation space. Further, we use the formalization to prove a relationship between the hierarchical structure of concepts and the geometry of their representations. We validate these theoretical results on the Gemma and LLaMA-3 large language models, estimating representations for 900+ hierarchically related concepts using data from WordNet.

The Geometry of Categorical and Hierarchical Concepts in Large Language Models

TL;DR

This work generalizes the linear representation hypothesis from simple binary directions to vector and polytope representations, linking semantic hierarchy to orthogonality in LLM representation spaces. By unifying embedding and unembedding spaces via a causal inner product, it defines vector representations for binary features and polytope representations for categorical concepts, and proves that hierarchical relations manifest as orthogonal subspaces. Empirically, the authors validate the theory on Gemma and LLaMA-3 using WordNet, showing that WordNet hierarchies are linearly represented and that manipulations along feature-vectors affect target concepts without disturbing off-target ones. The findings offer a foundational, geometrically grounded lens for interpreting LLM semantics and point toward hierarchy-aware interpretability tools and future work on internal-layer geometry.

Abstract

The linear representation hypothesis is the informal idea that semantic concepts are encoded as linear directions in the representation spaces of large language models (LLMs). Previous work has shown how to make this notion precise for representing binary concepts that have natural contrasts (e.g., {male, female}) as directions in representation space. However, many natural concepts do not have natural contrasts (e.g., whether the output is about an animal). In this work, we show how to extend the formalization of the linear representation hypothesis to represent features (e.g., is_animal) as vectors. This allows us to immediately formalize the representation of categorical concepts as polytopes in the representation space. Further, we use the formalization to prove a relationship between the hierarchical structure of concepts and the geometry of their representations. We validate these theoretical results on the Gemma and LLaMA-3 large language models, estimating representations for 900+ hierarchically related concepts using data from WordNet.
Paper Structure (32 sections, 7 theorems, 19 equations, 20 figures, 1 table)

This paper contains 32 sections, 7 theorems, 19 equations, 20 figures, 1 table.

Key Result

Theorem 4

Suppose there exists a linear representation (normalized direction) $\bar{\ell}_W$ of a binary feature $W$ for an attribute $w$. Then, there is a constant $b_w>0$ and a choice of unembedding space origin $\bar{\gamma}_0^w$ in eq:transformation such that Further, if there exist $d$ attributes $\{w_0, \dots, w_{d-1}\}$ such that the linear representations of the binary features for these attributes

Figures (20)

  • Figure 1: In the representation spaces of LLMs, hierarchically related concepts (such as $\texttt{plant}\Rightarrow\texttt{animal}$ and $\texttt{mammal}\Rightarrow\texttt{bird}$) live in orthogonal subspaces, while categorical concepts are represented as polytopes. The top panel illustrates the structure; the bottom panels show the measured representation structure in the Gemma LLM. See \ref{['sec:experiments']} and \ref{['sec:visualization']} for details.
  • Figure 2: Hierarchical semantics are encoded as orthogonality in the representation space (\ref{['thm:orthogonality']}). The plots show the projection of the unembedding vectors onto 2D subspaces: $\mathrm{span}\{\bar{\ell}_{\texttt{animal}}, \bar{\ell}_{\texttt{mammal}}\}$ (left; \ref{['item:left']}), $\mathrm{span}\{\bar{\ell}_{\texttt{animal}},\bar{\ell}_{\texttt{bird}} - \bar{\ell}_{\texttt{mammal}}\}$ (middle; \ref{['item:middle']}), and $\mathrm{span}\{\bar{\ell}_{\texttt{animal}} - \bar{\ell}_{\texttt{plant}}, \bar{\ell}_{\texttt{bird}} - \bar{\ell}_{\texttt{mammal}}\}$ (right; \ref{['item:right']}). Gray points indicate all 256K tokens in the vocabulary, and the colored points are the tokens in $\mathcal{Y}(w)$. The blue and red vectors are used to span the 2D subspaces.
  • Figure 3: Vector representations exist for most binary features in the WordNet noun hierarchy. For each synset $w$ (indexed on the $x$-axis) we estimate the vector representation $\bar{\ell}_w$ using a train subset of the vocabulary $\mathcal{Y}(w)$. The plot shows the projections $(g(y)^{\top}\bar{\ell}_w)/\|\bar{\ell}_w\|_2^2$ of train (green), test (blue), and random (orange) words on estimated vector representations for each WordNet feature, using either the original (left) or shuffled (right) unembeddings. Our theory predicts that this value should be close to 1 when $y$ has the target feature, and close to 0 when it does not. The thick lines present the mean of the projections for each feature and the error bars indicate the standard deviation. As predicted, the projections of test words are near 1, and random words near 0 (left plot). Further, this structure does not hold when using the shuffled control without natural semantics (right plot).
  • Figure 4: Hierarchical semantics in WordNet are linearly represented in Gemma-2B. The left heatmap shows pairwise shortest distance matrix between features in the noun hierarchy graph as $(1+ \text{min\_distance})^{-1}$ (higher values indicate closeness, such as in child-parent or sibling relationships). The middle heatmap shows the cosine similarity between the vector representations $\bar{\ell}_w$. As predicted, this similarity reflects the WordNet structure. The right heatmap is a control where the embeddings are randomly shuffled (removing semantic structure). In this case, nearly everything is orthogonal, as expected in high-dimensional space (set inclusion relationships remain due to the estimation procedure). In \ref{['sec:additional_results']}, we include zoomed-in versions of these heatmaps.
  • Figure 5: WordNet noun hierarchy is encoded in the orthogonal structure predicted by statement \ref{['item:left']} in \ref{['thm:orthogonality']}. We plot the cosine similarity between a child-parent vector and a parent vector for each feature in the hierarchy (blue). As predicted, this value is close to 0. The left plot uses all data for representation estimation, and the right plot uses only 70% independently selected for each synset. We include baselines where a randomly selected feature is used as the parent (orange) and where the embeddings are shuffled (green) as controls for the possibility that the orthogonality is a simple byproduct of high-dimensional geometry, or of the set inclusion relationships used in estimation---see main text for details. See \ref{['sec:additional_results']} for an analogous plot for statement \ref{['item:grandparent']}.
  • ...and 15 more figures

Theorems & Definitions (16)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 4: Magnitudes of Linear Representations
  • Definition 5
  • Corollary 5: Binary Contrasts Are Vector Differences of Binary Features
  • Definition 6
  • Theorem 7: Hierarchical Orthogonality
  • Theorem 7: Magnitudes of Linear Representations
  • proof
  • ...and 6 more