Table of Contents
Fetching ...

Identifying Linear Relational Concepts in Large Language Models

David Chanin, Anthony Hunter, Oana-Maria Camburu

TL;DR

This work addresses how large language models represent concepts in hidden activations by introducing linear relational concepts (LRCs), derived from inverting linear relational embeddings (LREs). The method computes a concept direction $v_o$ from $R(s)=Ws+b$ via the low-rank inverse $W^{\dagger}$, enabling both strong classification of relational concepts and causal edits to model outputs, even for multi-token objects. Extending prior LRE work, the authors train LREs on correctly answered prompts, use non-terminal object layers, and average over object tokens, with low-rank inverses peaking in performance around rank ~200 for a 4096-dim space. Across 47 relation types and two models (Llama2-7b and GPT-J), LRCs surpass probing baselines in both accuracy and causality, demonstrating a robust method to discover and manipulate concept directions in transformer hidden spaces. The approach opens avenues for visualization of computation and more controllable editing, while acknowledging limitations such as per-(r,o) training requirements and potential trade-offs across model layers.

Abstract

Transformer language models (LMs) have been shown to represent concepts as directions in the latent space of hidden activations. However, for any human-interpretable concept, how can we find its direction in the latent space? We present a technique called linear relational concepts (LRC) for finding concept directions corresponding to human-interpretable concepts by first modeling the relation between subject and object as a linear relational embedding (LRE). We find that inverting the LRE and using earlier object layers results in a powerful technique for finding concept directions that outperforms standard black-box probing classifiers. We evaluate LRCs on their performance as concept classifiers as well as their ability to causally change model output.

Identifying Linear Relational Concepts in Large Language Models

TL;DR

This work addresses how large language models represent concepts in hidden activations by introducing linear relational concepts (LRCs), derived from inverting linear relational embeddings (LREs). The method computes a concept direction from via the low-rank inverse , enabling both strong classification of relational concepts and causal edits to model outputs, even for multi-token objects. Extending prior LRE work, the authors train LREs on correctly answered prompts, use non-terminal object layers, and average over object tokens, with low-rank inverses peaking in performance around rank ~200 for a 4096-dim space. Across 47 relation types and two models (Llama2-7b and GPT-J), LRCs surpass probing baselines in both accuracy and causality, demonstrating a robust method to discover and manipulate concept directions in transformer hidden spaces. The approach opens avenues for visualization of computation and more controllable editing, while acknowledging limitations such as per-(r,o) training requirements and potential trade-offs across model layers.

Abstract

Transformer language models (LMs) have been shown to represent concepts as directions in the latent space of hidden activations. However, for any human-interpretable concept, how can we find its direction in the latent space? We present a technique called linear relational concepts (LRC) for finding concept directions corresponding to human-interpretable concepts by first modeling the relation between subject and object as a linear relational embedding (LRE). We find that inverting the LRE and using earlier object layers results in a powerful technique for finding concept directions that outperforms standard black-box probing classifiers. We evaluate LRCs on their performance as concept classifiers as well as their ability to causally change model output.
Paper Structure (16 sections, 8 equations, 11 figures, 8 tables)

This paper contains 16 sections, 8 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: We first model the relation between the subject $s$ and object $o$ as a linear transformation called a linear relation embedding (LRE), $R(s)$. We then invert $R(s)$ using a low-rank pseudo-inverse, resulting in $R^{-1}(o)$. Finally, we create an LRC $v$ for each object $o$ in the relation by applying $R^{-1}(o)$ to the mean object activation $\mathbb{E}[o]$. Above, we train an LRE from the statement "San Jose is in Costa Rica", then invert that LRE and create linear relational concepts (LRCs) representing "located in England" and "located in China" from representations of objects "York" and "Shanghai", respectively.
  • Figure 2: Sample few-shot (FS) prompt for the relation "adjective superlative", subject "angry", and object "angriest" from the dataset.
  • Figure 3: Classification accuracy by relation for LRC (ours) compared to SVM on Llama2-7b. Our method outperforms SVM on most, but not all, relations.
  • Figure 4: Classification accuracy and causality by object layer for multi-token objects on Llama2-7b.
  • Figure 5: Classification accuracy and causality by object layer for single-token objects on Llama2-7b.
  • ...and 6 more figures