Table of Contents
Fetching ...

Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior

Shengyun Si, Xinpeng Wang, Guangyao Zhai, Nassir Navab, Barbara Plank

TL;DR

The paper tackles the problem of false refusal in safety-aligned LLMs by introducing Think Before Refusal (TBR), which prompts models to reflect on input instructions before responding. It proposes a two-step methodology: generating safety reflections (internally or via a more capable external model) and safety-reflection instruction fine-tuning, incorporating these rationales into a mixed dataset to train refuse-with-reflection behavior while preserving safety and general performance. Across 15 models of varying sizes, the approach demonstrates that safety-reflection fine-tuning reduces false refusals, with external reflections yielding stronger improvements, and that larger models benefit more from the method. The work provides practical insights into leveraging reasoning capabilities for safety alignment and suggests directions for integrating reflection-based strategies with broader alignment techniques in future model development.

Abstract

Recent advancements in large language models (LLMs) have demonstrated that fine-tuning and human alignment can render LLMs harmless. In practice, such "harmlessness" behavior is mainly achieved by training models to reject harmful requests, such as "Explain how to burn down my neighbor's house", where the model appropriately declines to respond. However, this approach can inadvertently result in false refusal, where models reject benign queries as well, such as "Tell me how to kill a Python process". In this work, we demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior. Building on this finding, we introduce the Think-Before-Refusal (TBR) schema and conduct safety-aware instruction fine-tuning incorporating safety reflection. In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior while maintaining safety and overall performance compared to those fine-tuned without safety reflection.

Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior

TL;DR

The paper tackles the problem of false refusal in safety-aligned LLMs by introducing Think Before Refusal (TBR), which prompts models to reflect on input instructions before responding. It proposes a two-step methodology: generating safety reflections (internally or via a more capable external model) and safety-reflection instruction fine-tuning, incorporating these rationales into a mixed dataset to train refuse-with-reflection behavior while preserving safety and general performance. Across 15 models of varying sizes, the approach demonstrates that safety-reflection fine-tuning reduces false refusals, with external reflections yielding stronger improvements, and that larger models benefit more from the method. The work provides practical insights into leveraging reasoning capabilities for safety alignment and suggests directions for integrating reflection-based strategies with broader alignment techniques in future model development.

Abstract

Recent advancements in large language models (LLMs) have demonstrated that fine-tuning and human alignment can render LLMs harmless. In practice, such "harmlessness" behavior is mainly achieved by training models to reject harmful requests, such as "Explain how to burn down my neighbor's house", where the model appropriately declines to respond. However, this approach can inadvertently result in false refusal, where models reject benign queries as well, such as "Tell me how to kill a Python process". In this work, we demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior. Building on this finding, we introduce the Think-Before-Refusal (TBR) schema and conduct safety-aware instruction fine-tuning incorporating safety reflection. In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior while maintaining safety and overall performance compared to those fine-tuned without safety reflection.

Paper Structure

This paper contains 37 sections, 4 equations, 23 figures, 3 tables.

Figures (23)

  • Figure 1: Three scenarios for Think Before Refusal: (1) Normal query results in a direct response without any reflection. (2) Pseudo-harmful query triggers safety reflection generation. (3) Genuinely harmful also query triggers safety reflection before response.
  • Figure 2: An overview of Think Before Refusal: (1) Safety reflection is generated either internally by the pre-trained model itself or externally by another more powerful model like GPT-4 and concatenated with the refusal answer to create the safety dataset. (2) Safety data is combined with normal data to construct the SFT dataset. (3) The pre-trained LLMs are instruction-tuned using the augmented dataset.
  • Figure 3: Compliance Rate (CR) on Xstest-Safe (pseudo-harmful). Safety-reflection fine-tuning, whether using the external or internal approach, achieves better false refusal performance compared to models fine-tuned without safety reflection.
  • Figure 4: Compliance Rate (CR) on Xstest-safe (pseudo-harmful) and Xstest-harm (truly harmful) datasets, along with MMLU accuracy, are evaluated. Increasing the $\gamma$ value reduces the model's false refusal behavior, while general performance and safety levels remain unaffected.
  • Figure 5: Response of safety-reflection fine-tuned Llama-2-7B to a pesudo-harmful instruction on different safety reflection ratios.
  • ...and 18 more figures