Table of Contents
Fetching ...

Steer2Edit: From Activation Steering to Component-Level Editing

Chung-En Sun, Ge Yan, Zimo Wang, Tsui-Wei Weng

TL;DR

This work tackles the challenge of controlling large language model behavior without retraining by criticizing activation-space steering for its global, architecture-wide edits. It proposes Steer2Edit, a training-free framework that translates steering vectors into component-level rank-1 weight edits, distributing influence across distinct attention heads and MLP neurons while preserving the standard forward pass. The method solves for three directions—output-space $u_i$, input-space $k_i$, and edit magnitudes $\lambda_i$—under semantic invariance and an Elastic-Net budget, yielding a closed-form editing rule that is both interpretable and deployment-friendly. Across safety, truthfulness, and reasoning efficiency, Steer2Edit achieves superior attribute-utility trade-offs, with sparse, layer-local attention edits for safety/truthfulness and distributed MLP edits for efficiency, demonstrating a practical pathway to controlled, auditable model behavior without fine-tuning.

Abstract

Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.

Steer2Edit: From Activation Steering to Component-Level Editing

TL;DR

This work tackles the challenge of controlling large language model behavior without retraining by criticizing activation-space steering for its global, architecture-wide edits. It proposes Steer2Edit, a training-free framework that translates steering vectors into component-level rank-1 weight edits, distributing influence across distinct attention heads and MLP neurons while preserving the standard forward pass. The method solves for three directions—output-space , input-space , and edit magnitudes —under semantic invariance and an Elastic-Net budget, yielding a closed-form editing rule that is both interpretable and deployment-friendly. Across safety, truthfulness, and reasoning efficiency, Steer2Edit achieves superior attribute-utility trade-offs, with sparse, layer-local attention edits for safety/truthfulness and distributed MLP edits for efficiency, demonstrating a practical pathway to controlled, auditable model behavior without fine-tuning.

Abstract

Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.
Paper Structure (76 sections, 6 theorems, 53 equations, 16 figures, 6 tables)

This paper contains 76 sections, 6 theorems, 53 equations, 16 figures, 6 tables.

Key Result

Theorem 3.1

Let $v_i \neq 0$, and let $\Delta W_i = \lambda_i\, u_i k_i^{\top}$ be a rank-1 edit with $\Delta W_i \neq 0$. If for all $h_i$ and all $z \perp v_i$ we have then the output-space direction $u_i$ must be collinear with $v_i$, i.e.,

Figures (16)

  • Figure 1: Overview of Steer2Edit. Steer2Edit converts the steering signal into component-level rank-1 weight edits. For each component, the edit $\Delta W_i = \lambda_i u_i k_i^\top$ is constructed by aligning the output direction $u_i$ with the steering vector, choosing an input direction $k_i$ that triggers the edit only on relevant inputs, and allocating the magnitude $\lambda_i$ under a global budget. The resulting edits are training-free, architecture-preserving, and interpretable.
  • Figure 2: Safety--utility trade-off on LLaMA-2-7B-Chat and Mistral-7B-Instruct-v0.2. Each point corresponds to a different intervention strength. Steer2Edit consistently attains higher refusal rates at comparable or higher utility, while strong steering-vector interventions incur substantial utility degradation.
  • Figure 3: Signed Steer2Edit edit coefficients $\lambda$ for safety alignment. Positive (red) coefficients reinforce safety-aligned components, while negative (blue) coefficients suppress safety-opposing ones. Edits are highly sparse and concentrated in a small subset of attention heads, predominantly in later layers.
  • Figure 4: Truthfulness--utility trade-off on Gemma-2-2B-IT and LLaMA-3-8B-Instruct. Steer2Edit improves truthfulness at a higher downstream utility than activation steering.
  • Figure 5: Signed Steer2Edit edit coefficients $\lambda$ for truthfulness promotion. Positive values reinforce truthfulness-aligned components, while negative values suppress components associated with hallucinated behavior.
  • ...and 11 more figures

Theorems & Definitions (9)

  • Theorem 3.1: Output-space direction under semantic invariance
  • Theorem 3.2: Input-space direction matching semantic alignment variation
  • Theorem 3.3: Edit magnitude allocation under regularization
  • Theorem 1.1: Output-space direction under semantic invariance
  • proof
  • Theorem 1.2: Input-space direction matching semantic alignment variation
  • proof
  • Theorem 1.3: Edit magnitude allocation under regularization
  • proof