Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior
Shengyun Si, Xinpeng Wang, Guangyao Zhai, Nassir Navab, Barbara Plank
TL;DR
The paper tackles the problem of false refusal in safety-aligned LLMs by introducing Think Before Refusal (TBR), which prompts models to reflect on input instructions before responding. It proposes a two-step methodology: generating safety reflections (internally or via a more capable external model) and safety-reflection instruction fine-tuning, incorporating these rationales into a mixed dataset to train refuse-with-reflection behavior while preserving safety and general performance. Across 15 models of varying sizes, the approach demonstrates that safety-reflection fine-tuning reduces false refusals, with external reflections yielding stronger improvements, and that larger models benefit more from the method. The work provides practical insights into leveraging reasoning capabilities for safety alignment and suggests directions for integrating reflection-based strategies with broader alignment techniques in future model development.
Abstract
Recent advancements in large language models (LLMs) have demonstrated that fine-tuning and human alignment can render LLMs harmless. In practice, such "harmlessness" behavior is mainly achieved by training models to reject harmful requests, such as "Explain how to burn down my neighbor's house", where the model appropriately declines to respond. However, this approach can inadvertently result in false refusal, where models reject benign queries as well, such as "Tell me how to kill a Python process". In this work, we demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior. Building on this finding, we introduce the Think-Before-Refusal (TBR) schema and conduct safety-aware instruction fine-tuning incorporating safety reflection. In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior while maintaining safety and overall performance compared to those fine-tuned without safety reflection.
