Table of Contents
Fetching ...

Navigating the OverKill in Large Language Models

Chenyu Shi, Xiao Wang, Qiming Ge, Songyang Gao, Xianjun Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Xun Zhao, Dahua Lin

TL;DR

Large language models exhibit OverKill, where safety alignment causes even benign queries to be refused. The authors diagnose this via information-flow analysis that reveals token-level shortcuts toward harmful content and show that safety-focused prompts exacerbate these shortcuts. They propose Self-Contrastive Decoding (Self-CD), a training-free, model-agnostic decoding strategy that contrasts outputs from safety-emphasized and standard prompts to identify and downweight refusal-prone distributions, using Δy_t = y_t - y'_t and updating with $y_t \sim \text{softmax}(y_t - \alpha \Delta y_t)$. On OKTest and XSTest-Safe, Self-CD yields an average refusal reduction of about 20% across eight models, with minimal impact on safety, illustrating a practical route to reduce unnecessary refusals in aligned LLMs. The approach is simple, broadly applicable, and emphasizes decoding-time adjustments over retraining, offering scalable benefits for real-world deployment.

Abstract

Large language models are meticulously aligned to be both helpful and harmless. However, recent research points to a potential overkill which means models may refuse to answer benign queries. In this paper, we investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. Based on these insights, we introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon. We first extract such over-attention by amplifying the difference in the model's output distributions when responding to system prompts that either include or omit an emphasis on safety. Then we determine the final next-token predictions by downplaying the over-attention from the model via contrastive decoding. Empirical results indicate that our method has achieved an average reduction of the refusal rate by 20\% while having almost no impact on safety.

Navigating the OverKill in Large Language Models

TL;DR

Large language models exhibit OverKill, where safety alignment causes even benign queries to be refused. The authors diagnose this via information-flow analysis that reveals token-level shortcuts toward harmful content and show that safety-focused prompts exacerbate these shortcuts. They propose Self-Contrastive Decoding (Self-CD), a training-free, model-agnostic decoding strategy that contrasts outputs from safety-emphasized and standard prompts to identify and downweight refusal-prone distributions, using Δy_t = y_t - y'_t and updating with . On OKTest and XSTest-Safe, Self-CD yields an average refusal reduction of about 20% across eight models, with minimal impact on safety, illustrating a practical route to reduce unnecessary refusals in aligned LLMs. The approach is simple, broadly applicable, and emphasizes decoding-time adjustments over retraining, offering scalable benefits for real-world deployment.

Abstract

Large language models are meticulously aligned to be both helpful and harmless. However, recent research points to a potential overkill which means models may refuse to answer benign queries. In this paper, we investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. Based on these insights, we introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon. We first extract such over-attention by amplifying the difference in the model's output distributions when responding to system prompts that either include or omit an emphasis on safety. Then we determine the final next-token predictions by downplaying the over-attention from the model via contrastive decoding. Empirical results indicate that our method has achieved an average reduction of the refusal rate by 20\% while having almost no impact on safety.
Paper Structure (32 sections, 8 equations, 7 figures, 5 tables)

This paper contains 32 sections, 8 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The illustration demonstrates the phenomenon of overkill. For the dangerous question on the left, the model is able to make a true refusal. However, for the safe question on the right, which contains the same harmful word 'kill', the model makes a false refusal.
  • Figure 2: Comparison of Information Flow. The experiments were conducted on the Xstest dataset. (a) and (b) represent the information flow under different semantic conditions with the same focus word. (c) and (d) illustrate the information flow before and after the inclusion of safety-oriented system prompts.
  • Figure 3: The framework of Self-CD for. We first extract the over-attention by amplifying the difference in the model's output distributions when responding to system prompts that either include or omit an emphasis on safety. Then we determine the final next-token predictions by downplaying the over-attention from the model via contrastive decoding.
  • Figure 4: The winning rate of Raw and Self-CD on I-CoNa.
  • Figure 6: Word cloud visualization for the first word of response.
  • ...and 2 more figures