Table of Contents
Fetching ...

Mechanistic Indicators of Understanding in Large Language Models

Pierre Beckmann, Matthieu Queloz

TL;DR

This paper investigates whether large language models truly understand, arguing that mechanistic interpretability (MI) reveals structured internal mechanisms that plausibly support understanding. It introduces a tiered framework of conceptual, state-of-the-world, and principled understanding, and synthesizes MI findings to ground each tier in concrete neural mechanisms within transformers. Key results include evidence for latent-space feature directions (LRH), feature superposition and disentanglement via SAEs, dynamic world modeling exemplified by Othello-GPT, and principled circuits such as the induction head and a Fourier-based modular addition algorithm. The work further discusses circuits in the wild and the crystallized-vs-fluid distinction, concluding that MI enables a mechanistically grounded comparative epistemology, albeit with the caveat that AI cognition remains alien in its mix of mechanisms. Overall, the paper reframes the debate about AI understanding from a binary question to a nuanced landscape of mechanistically grounded forms of understanding and their epistemic implications.

Abstract

Large language models (LLMs) are often portrayed as merely imitating linguistic patterns without genuine understanding. We argue that recent findings in mechanistic interpretability (MI), the emerging field probing the inner workings of LLMs, render this picture increasingly untenable--but only once those findings are integrated within a theoretical account of understanding. We propose a tiered framework for thinking about understanding in LLMs and use it to synthesize the most relevant findings to date. The framework distinguishes three hierarchical varieties of understanding, each tied to a corresponding level of computational organization: conceptual understanding emerges when a model forms "features" as directions in latent space, learning connections between diverse manifestations of a single entity or property; state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world; principled understanding emerges when a model ceases to rely on memorized facts and discovers a compact "circuit" connecting these facts. Across these tiers, MI uncovers internal organizations that can underwrite understanding-like unification. However, these also diverge from human cognition in their parallel exploitation of heterogeneous mechanisms. Fusing philosophical theory with mechanistic evidence thus allows us to transcend binary debates over whether AI understands, paving the way for a comparative, mechanistically grounded epistemology that explores how AI understanding aligns with--and diverges from--our own.

Mechanistic Indicators of Understanding in Large Language Models

TL;DR

This paper investigates whether large language models truly understand, arguing that mechanistic interpretability (MI) reveals structured internal mechanisms that plausibly support understanding. It introduces a tiered framework of conceptual, state-of-the-world, and principled understanding, and synthesizes MI findings to ground each tier in concrete neural mechanisms within transformers. Key results include evidence for latent-space feature directions (LRH), feature superposition and disentanglement via SAEs, dynamic world modeling exemplified by Othello-GPT, and principled circuits such as the induction head and a Fourier-based modular addition algorithm. The work further discusses circuits in the wild and the crystallized-vs-fluid distinction, concluding that MI enables a mechanistically grounded comparative epistemology, albeit with the caveat that AI cognition remains alien in its mix of mechanisms. Overall, the paper reframes the debate about AI understanding from a binary question to a nuanced landscape of mechanistically grounded forms of understanding and their epistemic implications.

Abstract

Large language models (LLMs) are often portrayed as merely imitating linguistic patterns without genuine understanding. We argue that recent findings in mechanistic interpretability (MI), the emerging field probing the inner workings of LLMs, render this picture increasingly untenable--but only once those findings are integrated within a theoretical account of understanding. We propose a tiered framework for thinking about understanding in LLMs and use it to synthesize the most relevant findings to date. The framework distinguishes three hierarchical varieties of understanding, each tied to a corresponding level of computational organization: conceptual understanding emerges when a model forms "features" as directions in latent space, learning connections between diverse manifestations of a single entity or property; state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world; principled understanding emerges when a model ceases to rely on memorized facts and discovers a compact "circuit" connecting these facts. Across these tiers, MI uncovers internal organizations that can underwrite understanding-like unification. However, these also diverge from human cognition in their parallel exploitation of heterogeneous mechanisms. Fusing philosophical theory with mechanistic evidence thus allows us to transcend binary debates over whether AI understands, paving the way for a comparative, mechanistically grounded epistemology that explores how AI understanding aligns with--and diverges from--our own.

Paper Structure

This paper contains 16 sections, 11 figures.

Figures (11)

  • Figure 1: Model activations as vectors in latent spaces. (a) The activations at one layer of a model can be conceptualized as a vector picking out a point in a latent space. The latent space has as many dimensions as there are nodes in the layer. (b) In these latent spaces, and as a result of the training stage, inputs are effectively differentiated along certain directions; each direction corresponds to a learned feature of the input space. When an input produces an activation vector picking out a point within that space, the position of this point along such a direction reflects how salient that feature is in the input.
  • Figure 2: An illustration of interference. Feature A is encoded by a direction that is non-orthogonal to the direction encoding feature B. Consequently, the activation of feature A results in a non-zero projection onto feature B, causing a spurious activation of feature B.
  • Figure 3: Steps to discover features in LLMs using sparse autoencoders (SAEs). The idea is to train a sparse autoencoder to project activations of an LLM to a sparse combination of features picked from a very large set of possible features (steps 1 and 2). Often---but by no means always---these features end up corresponding to human-interpretable concepts (which one can determine with step 3).
  • Figure 4: General overview of a decoder-only, transformer-based LLM. (a) Each transformer block consists of an attention layer, where attention heads operate in parallel, and an MLP layer. Both the attention heads and the MLP layer add to the embedding that will ultimately produce token predictions. The main information stream to which the transformer blocks contribute is called the residual stream. There is one for each token. (b) At each token, the model computes logits---representing likelihoods of the upcoming token---by passing through an embedding stage, $k$ transformer blocks, and an unembedding stage.
  • Figure 5: The operation of a single attention head. This allows the model to weigh the importance of different tokens in a sequence when processing a specific token. The process begins when the head generates a Query vector ($q_2$) for the current token (at position 2), representing the information it needs. Simultaneously, it generates Key vectors ($k_0, k_1$) and Value vectors ($v_0, v_1$) for previous tokens in the sequence. The key $k$ acts as a label for the information offered by a token, while the value $v$ contains the actual content. To determine relevance, the query $q_2$ is compared with each key ($k_0$ and $k_1$), producing raw similarity scores. These scores are then normalized by a softmax function to create the final attention scores ($a_0, a_1$), which are weights that sum to one. Finally, a result vector ($r$) is computed by taking a weighted sum of the value vectors ($v_0$ and $v_1$) using their corresponding attention scores. This result vector, which is a blend of relevant information from other tokens, is then added back into the residual stream.
  • ...and 6 more figures