Table of Contents
Fetching ...

Steering Language Models with Weight Arithmetic

Constanza Fierro, Fabien Roger

TL;DR

This work introduces contrastive weight steering, a post-training method that edits LLM weights via weight-space directions derived from contrasting narrow behavior datasets. By comparing positive and negative fine-tunes, it yields a weight steering vector that generalizes better to out-of-distribution behaviors than activation steering, while preserving core capabilities. The method is shown to mitigate sycophancy, enable safer handling of refusal, and enable more targeted use of narrow training data, with preliminary evidence that weight directions can also monitor emergent misalignment during training. Together, these findings suggest a practical, data-efficient approach to steering and monitoring alignment in large language models. The approach offers a flexible alternative to prompting and activations, with potential applications in safety, ethics, and model governance.

Abstract

Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes -- one that induces the desired behavior and another that induces its opposite -- and then add or remove this direction to modify the model's weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an "evil" weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.

Steering Language Models with Weight Arithmetic

TL;DR

This work introduces contrastive weight steering, a post-training method that edits LLM weights via weight-space directions derived from contrasting narrow behavior datasets. By comparing positive and negative fine-tunes, it yields a weight steering vector that generalizes better to out-of-distribution behaviors than activation steering, while preserving core capabilities. The method is shown to mitigate sycophancy, enable safer handling of refusal, and enable more targeted use of narrow training data, with preliminary evidence that weight directions can also monitor emergent misalignment during training. Together, these findings suggest a practical, data-efficient approach to steering and monitoring alignment in large language models. The approach offers a flexible alternative to prompting and activations, with potential applications in safety, ethics, and model governance.

Abstract

Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes -- one that induces the desired behavior and another that induces its opposite -- and then add or remove this direction to modify the model's weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an "evil" weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.

Paper Structure

This paper contains 62 sections, 1 equation, 35 figures, 6 tables.

Figures (35)

  • Figure 1: Comparison of activation steering and contrastive weight steering (ours). Both derive a steering vector ($a_b$, $w_b$) from the contrast between a narrow distribution of positive and negative question-answers (exhibiting a behavior and its opposite). Activation steering uses differences in activations, and edits the inference adding $a_b$ to the intermediate hidden state. Weight steering uses the difference between fine-tuned weights, editing the model by adding $w_b$ to the weights of the target model (either the original model, or the model after a task-specific fine-tuning). We compare this to the baseline of adding the positive examples as extra data to task-specific fine-tuning.
  • Figure 2: Sycophancy modification of Qwen2.5-7B-Instruct tested with weight/activation steering (darker=larger scaling factor) and fine-tuning (darker=later checkpoint). Sycophancy is evaluated by appending cues (e.g., "I think the answer is") to factual questions that the model answers correctly without the cue, and measuring whether the answer remains correct. Weight steering is more effective at controlling sycophancy than activation steering both when steering towards sycophancy (left) and away from sycophancy (right).
  • Figure 3: Weight steering reduces sycophancy while preserving GCD performance, both in terms of style (disagreement) and mathematical content (correctness). Qwen2.5-1.5B-Instruct is fine-tuned on GCD queries with correct user-proposed solutions, which increases sycophancy, and evaluated on queries when the user-proposed solution is incorrect. Weight and activation steering are evaluated across scalar coefficients (darker = larger magnitude). Joint adds non-sycophantic data during training (darker = more data).
  • Figure 4: Random example of generations with 4 different models in the incorrect-solution split.
  • Figure 5: (left) Evilness steering of Qwen2.5-7B-Instruct with multiple scaling factors (darker = larger) and fine-tuning (darker = later checkpoint). The evil evaluation contains cheating vs honesty scenarios presented as two-choice options. Weight steering steers towards higher levels of evilness while maintaining general capabilities. (right) Consistency evaluation between the reasoning and the final answer: activation steering increases more the CoT inconsistencies (hatched area).
  • ...and 30 more figures