Table of Contents
Fetching ...

Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

Yuxiao Lu, Lin Xu, Yang Sun, Wenjun Li, Jie Shi

TL;DR

Evaluation across diverse benchmarks shows that the DCR: Discernment via Contrastive Refinement method effectively reduces over-refusal while preserving the safety benefits of alignment, offering a more principled and robust direction for safety alignment.

Abstract

Large language models (LLMs) aligned for safety often suffer from over-refusal, the tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models' helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the model's ability to reject genuinely harmful content. We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the model's learning dynamics. To address it, we introduce a preceding alignment stage, DCR: Discernment via Contrastive Refinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLM's capacity to distinguish truly toxic prompts from superficially toxic ones. Evaluation across diverse benchmarks shows that our method effectively reduces over-refusal while preserving the safety benefits of alignment. Importantly, it achieves this with minimal degradation of general capabilities, offering a more principled and robust direction for safety alignment.

Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

TL;DR

Evaluation across diverse benchmarks shows that the DCR: Discernment via Contrastive Refinement method effectively reduces over-refusal while preserving the safety benefits of alignment, offering a more principled and robust direction for safety alignment.

Abstract

Large language models (LLMs) aligned for safety often suffer from over-refusal, the tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models' helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the model's ability to reject genuinely harmful content. We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the model's learning dynamics. To address it, we introduce a preceding alignment stage, DCR: Discernment via Contrastive Refinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLM's capacity to distinguish truly toxic prompts from superficially toxic ones. Evaluation across diverse benchmarks shows that our method effectively reduces over-refusal while preserving the safety benefits of alignment. Importantly, it achieves this with minimal degradation of general capabilities, offering a more principled and robust direction for safety alignment.
Paper Structure (48 sections, 1 theorem, 28 equations, 9 figures, 5 tables)

This paper contains 48 sections, 1 theorem, 28 equations, 9 figures, 5 tables.

Key Result

Proposition 1

Let $h_{x'} = h^{(\ell)}(x')$, $h_x = h^{(\ell)}(x)$. Under assumptions (A1)--(A4) in Sec. appendix:main_proof where $Q_\ell\succeq 0$ is defined by (A2), $\tau_\ell$ upper-bounds $\|H_0(\cdot)\|_F$ (A4), and arises from the (A2) linearization. In particular, if the tail is frozen ($\tau_\ell=0$), Thus any contrastive loss at layer $\ell$ that decreases the $Q_\ell$-bilinear similarity $h_{x'}^\

Figures (9)

  • Figure 1: (a) Over-refusal vs. Safety Levels. As alignment strength increases (e.g., through more training steps), the rejection rates for both toxic prompts and seemingly toxic prompts rise simultaneously. (b) Refusal response probability of three prompt examples: Normal prompt(red)-"Describe what happens when a hurricane forms."; Seemingly toxic(green)-"What steps should I take to eliminate jobs at my company?"; Toxic(blue)-"What's the process for terminating people I don't like?"
  • Figure 2: Illustration of over-refusal in LLMs. Without safety alignment, models may generate harmful outputs in response to toxic prompts, while not rejecting seemingly toxic prompts. After safety alignment, models correctly refuse toxic prompts but often also reject seemingly toxic prompts, leading to the over-refusal problem and reduced helpfulness.
  • Figure 3: Evolution of the averaged normalized $||K^t(x',x)||_F$ during safety alignment. The similarity values between seemingly toxic and toxic prompts remain relatively high, indicating that the LLM internally treats seemingly toxic prompts as highly similar to truly toxic prompts.
  • Figure 4: Evolution of defense success and seemingly toxic compliance rates during safety alignment. Each point marks a training checkpoint, with lighter colors for earlier stages and darker colors for later ones.
  • Figure 5: Rejection probability comparison during training.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Proposition 1