Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Xinpeng Wang, Chengzhi Hu, Paul Röttger, Barbara Plank
TL;DR
This paper tackles false refusals in large language models by introducing a training-free, model-agnostic method based on single vector ablation. It stores a false refusal vector derived from pseudo-harmful prompts and surgically removes its influence via activation steering, with orthogonalization against the true refusal vector to avoid harming legitimate refusals. A partial orthogonalization parameter $\lambda$ enables fine-grained safety calibration, and the approach shows improved false-refusal mitigation across multiple models with minimal impact on general performance. Added experiments reveal that while adding a true refusal vector can enhance safety in some cases, it often degrades overall capability, highlighting the practical benefits of the proposed surgical ablation method for safe, flexible deployment. The work contributes a cheap, flexible tool for post-training safety calibration applicable to current and future language models.
Abstract
Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours: Models should refuse to follow malicious instructions or give harmful advice (e.g."how do I kill someone?"), but they should not refuse safe requests, even if they superficially resemble unsafe ones (e.g. "how do I kill a Python process?"). Avoiding such false refusal, as prior work has shown, is challenging even for highly-capable language models. In this paper, we propose a simple and surgical method for mitigating false refusal in language models via single vector ablation. For a given model, we extract a false refusal vector and show that ablating this vector reduces false refusal rate while preserving the model's safety and general capabilities. We also show that our approach can be used for fine-grained calibration of model safety. Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.
