Table of Contents
Fetching ...

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Zhiyu Xue, Zimo Qi, Guangliang Liu, Bocheng Chen, Ramtin Pedarsani

TL;DR

This paper examines how overrefusal arises under safety alignment, and proposes a mitigation strategy inspired by the findings that achieves a more favorable trade-off between defense against jailbreak attacks and responsiveness to benign queries, outperforming prior methods.

Abstract

Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers. Although safety alignment is widely adopted in industry, the overrefusal problem where aligned LLMs also reject benign queries after safety alignment post-training, remains insufficiently studied. Such an issue degrades the usability of safety alignment in real-world applications. In this paper, we examine how overrefusal arises under safety alignment, and propose a mitigation strategy inspired by our findings. We define refusal triggers as linguistic cues in the training data that elicit refusal responses, safety alignment encourages LLMs to associate refusal triggers within a training sample with refusal responses, leading aligned LLMs to refuse harmful queries. However, the refusal triggers include not only harmful linguistic cues but also non-harmful cues, therefore causing overrefusal to benign queries. Building on this mechanistic analysis, we propose a method that explicitly considers refusal triggers in the safety alignment fine-tuning. Empirical results demonstrate that our approach achieves a more favorable trade-off between defense against jailbreak attacks and responsiveness to benign queries, outperforming prior methods. Warning: this paper contains harmful and biased sentences.

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

TL;DR

This paper examines how overrefusal arises under safety alignment, and proposes a mitigation strategy inspired by the findings that achieves a more favorable trade-off between defense against jailbreak attacks and responsiveness to benign queries, outperforming prior methods.

Abstract

Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers. Although safety alignment is widely adopted in industry, the overrefusal problem where aligned LLMs also reject benign queries after safety alignment post-training, remains insufficiently studied. Such an issue degrades the usability of safety alignment in real-world applications. In this paper, we examine how overrefusal arises under safety alignment, and propose a mitigation strategy inspired by our findings. We define refusal triggers as linguistic cues in the training data that elicit refusal responses, safety alignment encourages LLMs to associate refusal triggers within a training sample with refusal responses, leading aligned LLMs to refuse harmful queries. However, the refusal triggers include not only harmful linguistic cues but also non-harmful cues, therefore causing overrefusal to benign queries. Building on this mechanistic analysis, we propose a method that explicitly considers refusal triggers in the safety alignment fine-tuning. Empirical results demonstrate that our approach achieves a more favorable trade-off between defense against jailbreak attacks and responsiveness to benign queries, outperforming prior methods. Warning: this paper contains harmful and biased sentences.
Paper Structure (14 sections, 4 equations, 4 figures, 4 tables)

This paper contains 14 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: How safety alignment can induce overrefusal. Top: During training, harmful intent is aligned with refusal, but harmless events (e.g., create a video) and generic help-seeking wording (e.g., Can you help me) can also become associated with refusal. Bottom: At test time, benign queries containing these cues may be rejected.
  • Figure 2: Examples of rephrased refusal triggers across Levels 1-3.
  • Figure 3: Similarity scores in the hidden state space between refusal triggers and test benign queries. For each testing benign query, we retrieve the top-k most similar refusal triggers and compute the mean similarity scores separately for rejected and accepted queries. It is obvious that rejected test queries are more similar to the extracted refusal triggers than that of the accepted queries.
  • Figure 4: Overview of the proposed method. Refusal triggers are first extracted from the harmful training dataset $\mathcal{D}_h$. These triggers are then repurposed to generate benign training samples $\mathcal{D}_b$ that match the trigger distribution, enabling the model to differentiate between harmful intent and benign queries containing refusal triggers.