Table of Contents
Fetching ...

Transmuting prompts into weights

Hanna Mazzawi, Benoit Dherin, Michael Munn, Michael Wunder, Javier Gonzalvo

TL;DR

The paper addresses how textual prompts influence large language models by internal computations and develops a theory that maps prompt content to token-dependent weight updates in transformers. It introduces thought patches, comprising a thought vector $\delta(I)$ and a thought matrix $\Delta(I)$, to condense transient token-level effects into durable, input-independent edits, unifying activation steering and matrix editing. The key contributions include (i) extending a single-block result to multi-block transformers, (ii) deriving practical least-squares formulations for $\delta(I)$ and $\Delta(I)$, and (iii) demonstrating empirical viability on arithmetic and translation tasks using a backpropagation-free weight-editing approach. This framework provides a principled explanation for common heuristics (contrastive averaging for steering vectors and low-rank edits) and offers a path toward reliable, theory-grounded model control without reliance on full prompt prompts at inference.

Abstract

A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving steering vectors from the average activations of contrastive prompts. This work provides a theoretical foundation for these interventions, explaining how they emerge from the fundamental computations of the transformer architecture. Building on the recent finding that a prompt's influence can be mathematically mapped to implicit weight updates (Dherin et al., 2025), we generalize this theory to deep, multi-block transformers. We show how the information contained in any chunk of a user prompt is represented and composed internally through weight vectors and weight matrices. We then derive a principled method for condensing this information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector- and matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates.

Transmuting prompts into weights

TL;DR

The paper addresses how textual prompts influence large language models by internal computations and develops a theory that maps prompt content to token-dependent weight updates in transformers. It introduces thought patches, comprising a thought vector and a thought matrix , to condense transient token-level effects into durable, input-independent edits, unifying activation steering and matrix editing. The key contributions include (i) extending a single-block result to multi-block transformers, (ii) deriving practical least-squares formulations for and , and (iii) demonstrating empirical viability on arithmetic and translation tasks using a backpropagation-free weight-editing approach. This framework provides a principled explanation for common heuristics (contrastive averaging for steering vectors and low-rank edits) and offers a path toward reliable, theory-grounded model control without reliance on full prompt prompts at inference.

Abstract

A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving steering vectors from the average activations of contrastive prompts. This work provides a theoretical foundation for these interventions, explaining how they emerge from the fundamental computations of the transformer architecture. Building on the recent finding that a prompt's influence can be mathematically mapped to implicit weight updates (Dherin et al., 2025), we generalize this theory to deep, multi-block transformers. We show how the information contained in any chunk of a user prompt is represented and composed internally through weight vectors and weight matrices. We then derive a principled method for condensing this information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector- and matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates.

Paper Structure

This paper contains 26 sections, 8 theorems, 59 equations, 2 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.1

Consider $n$ vectors $a_1, \dots, a_n$ in $\mathbb R^d$ with which we form the operators $\Delta_i = \frac{\delta_i a_i^T}{\|a_i\|^2}$ where the $\delta_i\in \mathbb R^d$ are fixed vectors. Then the following minimization problem over the space of $d\times d$ matrices has a unique solution if and only if the operator $Z = \sum_{i=1}^n a_i a_i^T$ is invertible. In this case the minimum is reached

Figures (2)

  • Figure 1: Summation on the left, and multiplication on the right. Accuracy (left Y-axis) given a step for training on the product dataset.
  • Figure 2: Accuracy when applying $\Delta W$s and $\delta b$ during the various steps. Vanilla Gemma with instructions achieves 0.72 accuracy based on the same Gemini model evaluator.

Theorems & Definitions (17)

  • Theorem 3.1
  • Remark 3.2
  • Theorem A.1
  • proof
  • Lemma B.1
  • proof
  • Lemma B.2
  • proof
  • Lemma B.3
  • proof
  • ...and 7 more