A Philosophical Introduction to Language Models - Part II: The Way Forward

Raphaël Millière; Cameron Buckner

A Philosophical Introduction to Language Models - Part II: The Way Forward

Raphaël Millière, Cameron Buckner

TL;DR

This two-part work surveys philosophical questions raised by large language models (LLMs), focusing on moving beyond behavioral benchmarks to uncover internal mechanisms via interventionist methods. It presents case studies on induction heads, modular addition, and world models to show that LLMs instantiate structured, causally relevant representations, challenging the view of them as mere memorization. The paper discusses newer trends—multimodal and agent-based architectures—and their philosophical implications for grounding, consciousness, and scientific legitimacy, while advocating for openness and reproducibility. It argues for a cautious middle ground: LLMs are useful partial models of certain cognitive processes but do not yet constitute full cognitive or conscious agents, underscoring the importance of rigorous methodology and interdisciplinary collaboration. Overall, the work highlights both the promise and the limits of LLMs as tools for understanding intelligence, prompting ongoing research into their internal computations and their role in cognitive science.

Abstract

In this paper, the second of two companion pieces, we explore novel philosophical questions raised by recent progress in large language models (LLMs) that go beyond the classical debates covered in the first part. We focus particularly on issues related to interpretability, examining evidence from causal intervention methods about the nature of LLMs' internal representations and computations. We also discuss the implications of multimodal and modular extensions of LLMs, recent debates about whether such systems may meet minimal criteria for consciousness, and concerns about secrecy and reproducibility in LLM research. Finally, we discuss whether LLM-like systems may be relevant to modeling aspects of human cognition, if their architectural characteristics and learning scenario are adequately constrained.

A Philosophical Introduction to Language Models - Part II: The Way Forward

TL;DR

Abstract

Paper Structure (22 sections, 7 figures)

This paper contains 22 sections, 7 figures.

Introduction
Mechanistic understanding and intervention methods
The trouble with benchmarks
Mechanistic explanation
Opening up the black box
Interventions on neural networks
Mechanistic interpretability
Case study 1: Induction heads
Case study 2: Modular addition
Case study 3: World models
Interpretability and causal abstraction
Biological plausibility of decoded computations
Newer philosophical questions
LLMs and modular architectures
Multimodality
...and 7 more sections

Figures (7)

Figure 1: Iterative nullspace projection. Given the representation $\vec{h_i}$ of a masked word in layer $i$ of a Transformer model, a probe is trained to predict relative clause boundaries. The probe's nullspace, which encodes information not relevant to relative clause boundaries, is identified. Two counterfactual representations, $\vec{h^{-}_{i}}$ and $\vec{h^{+}_{i}}$, are derived by projecting $\vec{h}_{i}$ onto the nullspace and then performing negative and positive interventions, respectively, along the probe's decision boundary. $\vec{h^{-}_{i}}$ encodes that the word is outside a relative clause, while $\vec{h^{+}_{i}}$ encodes that it is inside a relative clause, with other information preserved. The model's predictions and using these counterfactual representations are compared to its original prediction to assess the causal effect of the relative clause boundary information on the model's behavior in number agreement.
Figure 2: The view of the Transformer. Each input token is first embedded into a dense vector representation and combined with a positional encoding injecting information about their position in the input sequence. This forms the initial state of the "residual stream" (depicted by the red arrow), which flows through the entire network. Each Transformer block, consisting of multi-head self-attention and a multi-layer perceptron (MLP), reads from the residual stream, transforms the representation, and writes the result back into the stream via residual connections. This process is repeated across multiple Transformer blocks. Finally, the output of the residual stream is 'unembedded' to map the transformed representation back to the original token space. In this view, the Transformer's components are seen as operators that successively refine the residual representation.
Figure 3: Activation patching.A. In the original forward pass, the model takes as input the prompt "The capital of France is" and outputs the correct answer "Paris". The model activations from this forward pass are cached. B. In the alternative forward pass, the prompt is changed to "The capital of Germany is". The model now outputs "Berlin" as the answer. C. Activation patching is applied in a third forward pass. The model is given the alternative prompt "The capital of Germany is" once again, but a specific component of the model has its activations replaced (patched) with those from the original forward pass on the France prompt. This causes the model to output "Paris" instead of "Berlin", despite being given the Germany prompt. The restoration of the original output through patching the activations of a particular model component provides evidence that this component encodes information that is causally implicated in the target behavior.
Figure 4: A schematic illustration of the induction head circuit in a two-layer Transformer model. At the embedding stage, each token from the input sequence is encoded as a vector, together with information about its position in the sequence. The first layer contains an attention head – known as the previous token head – that acquired a specialized function during training. When processing token [B] at position $2$, the previous token head does the following: (1.1) it attends to the previous token at position $1$; (1.2) It writes the identity of this preceding token to a dedicated subspace of the residual stream at the current position (position $2$), effectively storing the information "the token before me is [A]". Layer 2 contains another specialized attention head known as the induction head. When processing the second instance of [A] at position $n$, the induction head does the following: (2.1) it queries the for information in the 'previous token' subspace matching the current token's identity; (2.2) having located this previous token information in the residual stream at position $2$, it retrieves the identity of the token at that position ([B]), then writes this identity to a dedicated subspace of the residual stream at the current position (position $n$), effectively storing the information "predict that the next token will be [B]". (3) The unembedding layer maps information in the 'next token' subspace at position $n$ to an increased logit for [B] at position $n+1$, which translates to an increased log likelihood of [B] being predicted as the next token.
Figure 5: A learned algorithm for modular addition (figure adapted from nandaProgressMeasuresGrokking2022). A. Embedding projection. Given two input numbers $a$ and $b$ in the modular addition $a + b \equiv c \mod P$, the model uses its embedding matrix to project each number onto a corresponding rotation around the unit circle. The embedding matrix essentially memorizes a mapping between each possible input number and a specific rotation amount, converting the numbers into geometric representations. B. Rotation composition. The model composes the two rotations generated for $a$ and $b$. This step effectively adding the two rotation amounts together, resulting in a new, single rotation that represents the sum $a+b$ in modular arithmetic. In modular arithmetic, numbers 'wrap around' after exceeding the modulus $P$, so if $a+b$ is greater than $P$, the resulting rotation will correspond to $a+b \mod P$, which is the remainder when $a+b$ is divided by $P$. C. Output decoding. To produce the output logits (raw scores used for next-token prediction), the model considers each possible result $c$ (ranging from $0$ to $P-1$) and performs a reverse rotation by $-c$. This step essentially checks, for each $c$, whether undoing the rotation by $c$ results in a rotation that matches the one representing $a+b \mod P$. The output $c$ that produces the rotation most closely matching the $a+b \mod P$ rotation is assigned the highest logit. This works because the trained model ensures that the correct $c$ satisfying $a + b \equiv c \mod P$ will undo the rotation by exactly the right amount to point back to the $a+b \mod P$ rotation. The trigonometric functions cosine and sine are used to implement these rotations and achieve the desired result mathematically using angle addition identities, but conceptually, the algorithm is based on representing numbers as rotations and composing these rotations together.
...and 2 more figures

A Philosophical Introduction to Language Models - Part II: The Way Forward

TL;DR

Abstract

A Philosophical Introduction to Language Models - Part II: The Way Forward

Authors

TL;DR

Abstract

Table of Contents

Figures (7)