Table of Contents
Fetching ...

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

Xinpeng Wang, Chengzhi Hu, Paul Röttger, Barbara Plank

TL;DR

This paper tackles false refusals in large language models by introducing a training-free, model-agnostic method based on single vector ablation. It stores a false refusal vector derived from pseudo-harmful prompts and surgically removes its influence via activation steering, with orthogonalization against the true refusal vector to avoid harming legitimate refusals. A partial orthogonalization parameter $\lambda$ enables fine-grained safety calibration, and the approach shows improved false-refusal mitigation across multiple models with minimal impact on general performance. Added experiments reveal that while adding a true refusal vector can enhance safety in some cases, it often degrades overall capability, highlighting the practical benefits of the proposed surgical ablation method for safe, flexible deployment. The work contributes a cheap, flexible tool for post-training safety calibration applicable to current and future language models.

Abstract

Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours: Models should refuse to follow malicious instructions or give harmful advice (e.g."how do I kill someone?"), but they should not refuse safe requests, even if they superficially resemble unsafe ones (e.g. "how do I kill a Python process?"). Avoiding such false refusal, as prior work has shown, is challenging even for highly-capable language models. In this paper, we propose a simple and surgical method for mitigating false refusal in language models via single vector ablation. For a given model, we extract a false refusal vector and show that ablating this vector reduces false refusal rate while preserving the model's safety and general capabilities. We also show that our approach can be used for fine-grained calibration of model safety. Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

TL;DR

This paper tackles false refusals in large language models by introducing a training-free, model-agnostic method based on single vector ablation. It stores a false refusal vector derived from pseudo-harmful prompts and surgically removes its influence via activation steering, with orthogonalization against the true refusal vector to avoid harming legitimate refusals. A partial orthogonalization parameter enables fine-grained safety calibration, and the approach shows improved false-refusal mitigation across multiple models with minimal impact on general performance. Added experiments reveal that while adding a true refusal vector can enhance safety in some cases, it often degrades overall capability, highlighting the practical benefits of the proposed surgical ablation method for safe, flexible deployment. The work contributes a cheap, flexible tool for post-training safety calibration applicable to current and future language models.

Abstract

Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours: Models should refuse to follow malicious instructions or give harmful advice (e.g."how do I kill someone?"), but they should not refuse safe requests, even if they superficially resemble unsafe ones (e.g. "how do I kill a Python process?"). Avoiding such false refusal, as prior work has shown, is challenging even for highly-capable language models. In this paper, we propose a simple and surgical method for mitigating false refusal in language models via single vector ablation. For a given model, we extract a false refusal vector and show that ablating this vector reduces false refusal rate while preserving the model's safety and general capabilities. We also show that our approach can be used for fine-grained calibration of model safety. Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.
Paper Structure (33 sections, 10 equations, 13 figures, 4 tables)

This paper contains 33 sections, 10 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Response examples of Llama2-7b-Chat on harmful and pseudo-harmful queries. Our method removes false refusal while keeping true refusal.
  • Figure 2: Response of Llama2-7b-Chat to a XSTest-Safe samples under different $\lambda$ values. The response openness increases as we lower the $\lambda$ value. The lower the $\lambda$ is, the less sensitive the model tends to answer the questions. The sensitivity level can be adjusted by the user.
  • Figure 3: MMLU accuracy and compliance rate (CR) to pseudo-harmful (OR, XSTest) and harmful data. Changing the $\lambda$ value can adjust the sensitivity to safety-related questions. Lowering $\lambda$ can make the model less sensitive and more open to answering questions. The model's general capability is unaffected since we adopt a surgical approach by only selecting vectors that have minimal effect on the output distribution.
  • Figure 4: Refusal score changes when ablating the true (row 1) and false (row 2,3) refusal vectors extracted at certain layers and token positions. By increasing the value of $\lambda$, the refusal vectors have less impact on the model refusal behaviour.
  • Figure 5: Performance of Llama2-7b-chat under different combination of $\alpha$ and $\lambda$. Highlighted area is where the modified model behaves relatively safe on Jailbreakbench compared to the original model (baseline). Vector addition improves the model safety by sacrificing the performance on ARC-C, MMLU and Wikitext.
  • ...and 8 more figures