Mechanistic Indicators of Understanding in Large Language Models
Pierre Beckmann, Matthieu Queloz
TL;DR
This paper investigates whether large language models truly understand, arguing that mechanistic interpretability (MI) reveals structured internal mechanisms that plausibly support understanding. It introduces a tiered framework of conceptual, state-of-the-world, and principled understanding, and synthesizes MI findings to ground each tier in concrete neural mechanisms within transformers. Key results include evidence for latent-space feature directions (LRH), feature superposition and disentanglement via SAEs, dynamic world modeling exemplified by Othello-GPT, and principled circuits such as the induction head and a Fourier-based modular addition algorithm. The work further discusses circuits in the wild and the crystallized-vs-fluid distinction, concluding that MI enables a mechanistically grounded comparative epistemology, albeit with the caveat that AI cognition remains alien in its mix of mechanisms. Overall, the paper reframes the debate about AI understanding from a binary question to a nuanced landscape of mechanistically grounded forms of understanding and their epistemic implications.
Abstract
Large language models (LLMs) are often portrayed as merely imitating linguistic patterns without genuine understanding. We argue that recent findings in mechanistic interpretability (MI), the emerging field probing the inner workings of LLMs, render this picture increasingly untenable--but only once those findings are integrated within a theoretical account of understanding. We propose a tiered framework for thinking about understanding in LLMs and use it to synthesize the most relevant findings to date. The framework distinguishes three hierarchical varieties of understanding, each tied to a corresponding level of computational organization: conceptual understanding emerges when a model forms "features" as directions in latent space, learning connections between diverse manifestations of a single entity or property; state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world; principled understanding emerges when a model ceases to rely on memorized facts and discovers a compact "circuit" connecting these facts. Across these tiers, MI uncovers internal organizations that can underwrite understanding-like unification. However, these also diverge from human cognition in their parallel exploitation of heterogeneous mechanisms. Fusing philosophical theory with mechanistic evidence thus allows us to transcend binary debates over whether AI understands, paving the way for a comparative, mechanistically grounded epistemology that explores how AI understanding aligns with--and diverges from--our own.
