Steering Language Models with Weight Arithmetic
Constanza Fierro, Fabien Roger
TL;DR
This work introduces contrastive weight steering, a post-training method that edits LLM weights via weight-space directions derived from contrasting narrow behavior datasets. By comparing positive and negative fine-tunes, it yields a weight steering vector that generalizes better to out-of-distribution behaviors than activation steering, while preserving core capabilities. The method is shown to mitigate sycophancy, enable safer handling of refusal, and enable more targeted use of narrow training data, with preliminary evidence that weight directions can also monitor emergent misalignment during training. Together, these findings suggest a practical, data-efficient approach to steering and monitoring alignment in large language models. The approach offers a flexible alternative to prompting and activations, with potential applications in safety, ethics, and model governance.
Abstract
Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes -- one that induces the desired behavior and another that induces its opposite -- and then add or remove this direction to modify the model's weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an "evil" weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.
