Equivalence of Context and Parameter Updates in Modern Transformer Blocks
Adrian Goldwaser, Michael Munn, Javier Gonzalvo, Benoit Dherin
TL;DR
The paper tackles how in-context learning emerges in modern transformers by proving that the context can be absorbed as implicit patches to MLP weights and normalization parameters. It provides a constructive proof for a Gemma-style block, then extends the result to multi-layer architectures and offers a practical, layer-by-layer algorithm to compute the required updates. A general framework based on input controllability and output controllability unifies these results and extends them to a broad class of architectures, including gating, RMSNorm, and MoE variants, with empirical validation on Gemma 3 showing near-perfect logit matching in many settings. The work delivers a principled lens for understanding prompt-induced reconfiguration and informs architectural design for robust in-context learning, while noting the token-dependent nature of updates that precludes a single global inference-time patch.
Abstract
Recent research has established that the impact of context in a vanilla transformer can be represented implicitly by forming a token-dependent, rank-1 patch to its MLP weights. This work extends that foundational theory to the diverse architectures of modern Large Language Models. We first demonstrate a precise, analytical solution for a Gemma-style transformer block, proving that the entire effect of a context can be perfectly mapped to rank-1 patches on its MLP weight matrices and a patch to the RMSNorm scale. We then generalize this result, providing a constructive proof and algorithm for multi-layer models. To unify these findings, we introduce a general framework centered on two core properties: input controllability and output controllability. We prove that a perfect implicit weight patch is possible for any MLP block where the inner function is input-controllable and the outer function is output-controllable. This provides a simpler and more powerful lens for understanding how transformer models transmute prompts into effective weights. This setup generalizes to a wide range of modern LLM architectures including gating, pre-/post-norm, mixture of experts and sequential/parallel transformer blocks.
