Table of Contents
Fetching ...

Investigating Representation Universality: Case Study on Genealogical Representations

David D. Baek, Yuxiao Li, Max Tegmark

TL;DR

The paper investigates whether LLMs encode discrete graph-structured knowledge using universal geometric representations. It uses two complementary approaches: a cone-probe analysis of in-context genealogy tasks to identify tree-like subspaces and activation patching to test causality, and cross-model stitching across diverse architectures to assess representational alignment. The findings show emergent tree-like cone embeddings in residual activations and stronger alignment in early-to-mid layers across models, supporting the universality hypothesis while acknowledging limitations due to small graphs and lack of ground-truth representations. These results advance interpretability by suggesting generalizable geometric structures in LLMs and point to future work on larger graphs and uncertainty estimation. Overall, understanding these representations could inform the design of more interpretable, robust, and controllable AI systems.

Abstract

Motivated by interpretability and reliability, we investigate whether large language models (LLMs) deploy universal geometric structures to encode discrete, graph-structured knowledge. To this end, we present two complementary experimental evidence that might support universality of graph representations. First, on an in-context genealogy Q&A task, we train a cone probe to isolate a tree-like subspace in residual stream activations and use activation patching to verify its causal effect in answering related questions. We validate our findings across five different models. Second, we conduct model stitching experiments across models of diverse architectures and parameter counts (OPT, Pythia, Mistral, and LLaMA, 410 million to 8 billion parameters), quantifying representational alignment via relative degradation in the next-token prediction loss. Generally, we conclude that the lack of ground truth representations of graphs makes it challenging to study how LLMs represent them. Ultimately, improving our understanding of LLM representations could facilitate the development of more interpretable, robust, and controllable AI systems.

Investigating Representation Universality: Case Study on Genealogical Representations

TL;DR

The paper investigates whether LLMs encode discrete graph-structured knowledge using universal geometric representations. It uses two complementary approaches: a cone-probe analysis of in-context genealogy tasks to identify tree-like subspaces and activation patching to test causality, and cross-model stitching across diverse architectures to assess representational alignment. The findings show emergent tree-like cone embeddings in residual activations and stronger alignment in early-to-mid layers across models, supporting the universality hypothesis while acknowledging limitations due to small graphs and lack of ground-truth representations. These results advance interpretability by suggesting generalizable geometric structures in LLMs and point to future work on larger graphs and uncertainty estimation. Overall, understanding these representations could inform the design of more interpretable, robust, and controllable AI systems.

Abstract

Motivated by interpretability and reliability, we investigate whether large language models (LLMs) deploy universal geometric structures to encode discrete, graph-structured knowledge. To this end, we present two complementary experimental evidence that might support universality of graph representations. First, on an in-context genealogy Q&A task, we train a cone probe to isolate a tree-like subspace in residual stream activations and use activation patching to verify its causal effect in answering related questions. We validate our findings across five different models. Second, we conduct model stitching experiments across models of diverse architectures and parameter counts (OPT, Pythia, Mistral, and LLaMA, 410 million to 8 billion parameters), quantifying representational alignment via relative degradation in the next-token prediction loss. Generally, we conclude that the lack of ground truth representations of graphs makes it challenging to study how LLMs represent them. Ultimately, improving our understanding of LLM representations could facilitate the development of more interpretable, robust, and controllable AI systems.

Paper Structure

This paper contains 11 sections, 9 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Visualization of the top two principal components of an MLP trained to learn the descendant-of relationship across nine different random seeds -- for models trained on either (left) a fully balanced binary tree or a (right) randomly generated general tree consisting of 15 nodes. For clarity, we add arrows connecting direct parent–child pairs. Each plot is rotated so that the root node appears at the top of the panel. Across different seeds and tree structures, the learned representations consistently exhibit a geometric pattern that resembles a tree in discrete mathematics -- a structure we define as cone embedding in the main text. Note that the models do not separate two sibling leaf nodes under the same parent. This is because all embeddings are initialized to zero, and the model receives no gradient signals to separate two sibling leaf nodes -- they are equivalent nodes when it comes to determining the descendant-of relationship.
  • Figure 2: Top: Visualization of in-context genealogy-tree representations from LLaMA-3.1-8B-Instruct across five different random name assignments on a full binary tree of 15 nodes. We show the projection onto the first two principal components, and the Projection onto the cone-probe subspace. Nodes and edges are colored by their depth in the tree. We added arrows connecting direct parent-to-child links for visualization. Bottom: Average F1 score on question-answering tasks about descendant-of relationships, averaged over five different name assignments on a tree. These results suggest that the model may struggle with compositional generalization if the relevant facts are not provided in order.
  • Figure 3: Top: Illustration of our intervention methodology. Bottom: Intervention results across five models. The histogram shows the causal effect of patching two subspaces of the residual stream activations at one-third model depth: (a) the subspace spanned by the top two principal components and (b) the cone subspace.
  • Figure 4: Intervention results for Llama-3.1-8B-Instruct across different layers. The plot shows the causal effect of patching two subspaces of the residual stream activations: (a) the subspace spanned by the top two principal components and (b) the cone subspace. Standard errors are indicated as a shaded region. Full represents patching the full activation at a specific layer.
  • Figure 5: Left: In-context learning accuracy for models stitched between OPT-2.7B and OPT-6.7B. Base indicates OPT-6.7B, and $x$% indicate the embedding layer and first $x$% of the OPT-6.7B replaced by those of OPT-2.7B. Right: Test loss as a function of stitched position between two different models. The two models are cut at the same relative depth within each model. The black dashed line on the right figure indicates the average test loss of original models.
  • ...and 2 more figures