Table of Contents
Fetching ...

Weight Updates as Activation Shifts: A Principled Framework for Steering

Dyah Adila, John Cooper, Alexander Yun, Avi Trost, Frederic Sala

TL;DR

A first-order equivalence is established between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior and identifies the post-block output as a theoretically-backed and highly expressive intervention site.

Abstract

Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices -- such as intervention location and parameterization -- that currently rely on empirical heuristics rather than a principled foundation. We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior. This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site. We further explain why certain intervention locations outperform others and show that weight updates and activation updates play distinct, complementary functional roles. This analysis motivates a new approach -- joint adaptation -- that trains in both spaces simultaneously. Our post-block steering achieves accuracy within 0.2%-0.9%$ of full-parameter tuning, on average across tasks and models, while training only 0.04% of model parameters. It consistently outperforms prior activation steering methods such as ReFT and PEFT approaches including LoRA, while using significantly fewer parameters. Finally, we show that joint adaptation often surpasses the performance ceilings of weight and activation updates in isolation, introducing a new paradigm for efficient model adaptation.

Weight Updates as Activation Shifts: A Principled Framework for Steering

TL;DR

A first-order equivalence is established between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior and identifies the post-block output as a theoretically-backed and highly expressive intervention site.

Abstract

Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices -- such as intervention location and parameterization -- that currently rely on empirical heuristics rather than a principled foundation. We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior. This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site. We further explain why certain intervention locations outperform others and show that weight updates and activation updates play distinct, complementary functional roles. This analysis motivates a new approach -- joint adaptation -- that trains in both spaces simultaneously. Our post-block steering achieves accuracy within 0.2%-0.9%$ of full-parameter tuning, on average across tasks and models, while training only 0.04% of model parameters. It consistently outperforms prior activation steering methods such as ReFT and PEFT approaches including LoRA, while using significantly fewer parameters. Finally, we show that joint adaptation often surpasses the performance ceilings of weight and activation updates in isolation, introducing a new paradigm for efficient model adaptation.
Paper Structure (52 sections, 13 theorems, 88 equations, 6 figures, 13 tables)

This paper contains 52 sections, 13 theorems, 88 equations, 6 figures, 13 tables.

Key Result

Theorem 3.1

Let $V$ and $V'$ be the right-singular matrices of $h+\mathrm{Attn}(h) + \mathrm{GLU}(h + \mathrm{Attn}(h))$ and $A_p\mathrm{GLU}(h + \mathrm{Attn}(h))$ respectively, and let $X,Y$ be such that $Y_i = h_i + \mathrm{Attn}(h_i)$ and $X_i = \mathrm{GLU}(Y_i)$. Then, the optimal relative error satisfies where $\sigma_i$ is the $i$-th singular value of $A_p X$ and $\theta_i$ is the $i$-th principal ang

Figures (6)

  • Figure 1: Left: Weight tuning scales with MLP dimension, while activation steering scales only with model dimension. Right: pre-MLP vs. post-MLP vs. post-block (ours). We steer after the skip connection is added back to the MLP output, accounting for both pathways.
  • Figure 2: Average MLP output norm with respect to the layer's block output for Llama-3.2B on Winogrande. Post-MLP steering covers $\leq70\%$ of the change from finetuning.
  • Figure 3: Joint training when done naively. The dot products between the top right singular vectors of the initial updates for both weight and activation adapters. Weight and activation adapters learn to fit the same subspace early on, represented by the diagonal.
  • Figure 4: Effect of rank and linearity across model sizes
  • Figure 5: Distribution of cosine similarity between weight and adapter shifts. Closer to zero indicates more orthogonal solutions.
  • ...and 1 more figures

Theorems & Definitions (25)

  • Theorem 3.1
  • proof
  • Proposition 4.1
  • Proposition 4.2
  • proof : Proof (sketch)
  • Proposition C.1
  • proof
  • Theorem C.2
  • proof
  • Proposition C.3
  • ...and 15 more