Equivalent Linear Mappings of Large Language Models

James R. Golden

Equivalent Linear Mappings of Large Language Models

James R. Golden

TL;DR

The paper tackles the interpretability challenge of large language models by recasting inference as an equivalent linear system through a detached Jacobian that freezes nonlinear terms at a fixed input. This yields a pointwise-linear mapping from input embeddings to output embeddings, enabling exact reconstruction with relative error below $10^{-13}$ for several model families without additional training. Empirically, the detached Jacobian exhibits a very low-rank structure, with singular vectors that map to interpretable semantic concepts and facilitate layer- and module-level analysis, including a steering use-case. The approach opens a path toward scalable, input-specific linear interpretability of Transformer decoders and suggests practical directions for semantic steering and deeper structural understanding of next-token prediction.

Abstract

Despite significant progress in transformer interpretability, an understanding of the computational mechanisms of large language models (LLMs) remains a fundamental challenge. Many approaches interpret a network's hidden representations but remain agnostic about how those representations are generated. We address this by mapping LLM inference for a given input sequence to an equivalent and interpretable linear system which reconstructs the predicted output embedding with relative error below $10^{-13}$ at double floating-point precision, requiring no additional model training. We exploit a property of transformers wherein every operation (gated activations, attention, and normalization) can be expressed as $A(x) \cdot x$, where $A(x)$ represents an input-dependent linear transform and $x$ preserves the linear pathway. To expose this linear structure, we strategically detach components of the gradient computation with respect to an input sequence, freezing the $A(x)$ terms at their values computed during inference, such that the Jacobian yields an equivalent linear mapping. This detached Jacobian of the model reconstructs the output with one linear operator per input token, which is shown for Qwen 3, Gemma 3 and Llama 3, up to Qwen 3 14B. These linear representations demonstrate that LLMs operate in extremely low-dimensional subspaces where the singular vectors can be decoded to interpretable semantic concepts. The computation for each intermediate output also has a linear equivalent, and we examine how the linear representations of individual layers and their attention and multilayer perceptron modules build predictions, and use these as steering operators to insert semantic concepts into unrelated text. Despite their global nonlinearity, LLMs can be interpreted through equivalent linear representations that reveal low-dimensional semantic structures in the next-token prediction process.

Equivalent Linear Mappings of Large Language Models

TL;DR

for several model families without additional training. Empirically, the detached Jacobian exhibits a very low-rank structure, with singular vectors that map to interpretable semantic concepts and facilitate layer- and module-level analysis, including a steering use-case. The approach opens a path toward scalable, input-specific linear interpretability of Transformer decoders and suggests practical directions for semantic steering and deeper structural understanding of next-token prediction.

Abstract

at double floating-point precision, requiring no additional model training. We exploit a property of transformers wherein every operation (gated activations, attention, and normalization) can be expressed as

, where

represents an input-dependent linear transform and

preserves the linear pathway. To expose this linear structure, we strategically detach components of the gradient computation with respect to an input sequence, freezing the

terms at their values computed during inference, such that the Jacobian yields an equivalent linear mapping. This detached Jacobian of the model reconstructs the output with one linear operator per input token, which is shown for Qwen 3, Gemma 3 and Llama 3, up to Qwen 3 14B. These linear representations demonstrate that LLMs operate in extremely low-dimensional subspaces where the singular vectors can be decoded to interpretable semantic concepts. The computation for each intermediate output also has a linear equivalent, and we examine how the linear representations of individual layers and their attention and multilayer perceptron modules build predictions, and use these as steering operators to insert semantic concepts into unrelated text. Despite their global nonlinearity, LLMs can be interpreted through equivalent linear representations that reveal low-dimensional semantic structures in the next-token prediction process.

Equivalent Linear Mappings of Large Language Models

TL;DR

Abstract

Equivalent Linear Mappings of Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)