Table of Contents
Fetching ...

Equivalence of Context and Parameter Updates in Modern Transformer Blocks

Adrian Goldwaser, Michael Munn, Javier Gonzalvo, Benoit Dherin

TL;DR

The paper tackles how in-context learning emerges in modern transformers by proving that the context can be absorbed as implicit patches to MLP weights and normalization parameters. It provides a constructive proof for a Gemma-style block, then extends the result to multi-layer architectures and offers a practical, layer-by-layer algorithm to compute the required updates. A general framework based on input controllability and output controllability unifies these results and extends them to a broad class of architectures, including gating, RMSNorm, and MoE variants, with empirical validation on Gemma 3 showing near-perfect logit matching in many settings. The work delivers a principled lens for understanding prompt-induced reconfiguration and informs architectural design for robust in-context learning, while noting the token-dependent nature of updates that precludes a single global inference-time patch.

Abstract

Recent research has established that the impact of context in a vanilla transformer can be represented implicitly by forming a token-dependent, rank-1 patch to its MLP weights. This work extends that foundational theory to the diverse architectures of modern Large Language Models. We first demonstrate a precise, analytical solution for a Gemma-style transformer block, proving that the entire effect of a context can be perfectly mapped to rank-1 patches on its MLP weight matrices and a patch to the RMSNorm scale. We then generalize this result, providing a constructive proof and algorithm for multi-layer models. To unify these findings, we introduce a general framework centered on two core properties: input controllability and output controllability. We prove that a perfect implicit weight patch is possible for any MLP block where the inner function is input-controllable and the outer function is output-controllable. This provides a simpler and more powerful lens for understanding how transformer models transmute prompts into effective weights. This setup generalizes to a wide range of modern LLM architectures including gating, pre-/post-norm, mixture of experts and sequential/parallel transformer blocks.

Equivalence of Context and Parameter Updates in Modern Transformer Blocks

TL;DR

The paper tackles how in-context learning emerges in modern transformers by proving that the context can be absorbed as implicit patches to MLP weights and normalization parameters. It provides a constructive proof for a Gemma-style block, then extends the result to multi-layer architectures and offers a practical, layer-by-layer algorithm to compute the required updates. A general framework based on input controllability and output controllability unifies these results and extends them to a broad class of architectures, including gating, RMSNorm, and MoE variants, with empirical validation on Gemma 3 showing near-perfect logit matching in many settings. The work delivers a principled lens for understanding prompt-induced reconfiguration and informs architectural design for robust in-context learning, while noting the token-dependent nature of updates that precludes a single global inference-time patch.

Abstract

Recent research has established that the impact of context in a vanilla transformer can be represented implicitly by forming a token-dependent, rank-1 patch to its MLP weights. This work extends that foundational theory to the diverse architectures of modern Large Language Models. We first demonstrate a precise, analytical solution for a Gemma-style transformer block, proving that the entire effect of a context can be perfectly mapped to rank-1 patches on its MLP weight matrices and a patch to the RMSNorm scale. We then generalize this result, providing a constructive proof and algorithm for multi-layer models. To unify these findings, we introduce a general framework centered on two core properties: input controllability and output controllability. We prove that a perfect implicit weight patch is possible for any MLP block where the inner function is input-controllable and the outer function is output-controllable. This provides a simpler and more powerful lens for understanding how transformer models transmute prompts into effective weights. This setup generalizes to a wide range of modern LLM architectures including gating, pre-/post-norm, mixture of experts and sequential/parallel transformer blocks.

Paper Structure

This paper contains 22 sections, 10 theorems, 22 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Let $\mathbf{v}_C = A(C, \mathbf{x})$ and $\mathbf{v} = A(C \setminus Y, \mathbf{x})$ be the intermediate outputs from the attention sub-layer with the full context and a reduced context, respectively. Let their normalized versions be $\mathbf{z}_C = N_{\text{RMS}}(\mathbf{v}_C)$ and $\mathbf{z} = N where the division $\oslash$ in eq:delta_m is performed element-wise.

Figures (5)

  • Figure 1: Gemma MLP block diagram. $\mathbf{m}$ is part of the second RMS normalization ($\text{RMSNorm}_2$) but stated separately to match the equations. $\otimes$ denotes elementwise multiplication of vectors here.
  • Figure 2: Multi-layer equivalence diagram. The left column shows the model with updated parameters and no explicit context. The right column shows the original model with full context. At each layer $i$, we have $\mathbf{x}{'}_{i+1} = T{'}_i(C\setminus Y,\mathbf{x}{'}_{i})=T_i(C,\mathbf{x}_{i})=\mathbf{x}_{i+1}$. The deltas are now $\Delta A_{\mathbf{x}_i}(Y)=\textcolor{blue}{A_i(C,\mathbf{x}_i)} - \textcolor{red}{A(C\setminus Y, \mathbf{x}_i)}$ and the equivalent normed version. Note that the $\mathbf{x}{'}_i$ are different from the intermediate values when simply running a forward pass with the original parameters without context.
  • Figure 3: Comparison of generation metrics between the original and updated models. The top plot shows the $L_\infty$ norm of the logit difference and the bottom plot shows the Total Variation Distance plotted at each step of the token generation process. The x-axis displays the sequence of generated tokens. We show this separately for each platform/data type. A red 'X' indicates that the predicted tokens did not match there.
  • Figure 4: Comparison of update accuracy for different data types and platforms. Here we show the distribution of the logit difference and the accuracy percentage over many textual generations.
  • Figure 5: Comparison of generation metrics between the original and updated models on images. This is a matching experiment as above but on Gemma 3 4B with an image as part of the context. We can see that this method continues to work with multi-modal input.

Theorems & Definitions (13)

  • Theorem 1: Single Block Equivalence
  • Theorem 2: Multi-Layer Equivalence
  • Definition 3: Input Controllability
  • Definition 4: Output Controllability
  • Theorem 5: Unified Theorem for Residual Blocks
  • Lemma 6: Input Controllability of MLPs
  • Lemma 7: Input Controllability of Pre-Norm MLPs
  • Lemma 8: Output Controllability of Outer Bias
  • Lemma 9: Output Controllability of Outer Weight Matrix
  • Lemma 10: Output Controllability of Outer Element-wise Multiply
  • ...and 3 more