Table of Contents
Fetching ...

Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety

Yuyou Zhang, Miao Li, William Han, Yihang Yao, Zhepeng Cen, Ding Zhao

TL;DR

This paper tackles the vulnerability of LLM safety to jailbreaks that bypass static refusal strategies. It introduces Rational, a reasoning-enhanced fine-tuning framework that trains models to explicitly reason about safety using self-generated safety rationales before responding, thereby rejecting unsafe prompts with justification while remaining helpful on benign inputs. By curating a Rationale Dataset that pairs prompts with explicit reasoning and leveraging LoRA-based supervised fine-tuning, Rational achieves state-of-the-art robustness against reasoning-based and gradient-based attacks, generalizes to unseen threats, and improves factual correctness and toxicity detection. The approach also reveals that safety and helpfulness can be balanced, particularly when benign rationales are included in training, though challenges remain in multi-turn settings and evolving attack strategies. Overall, reasoning-based alignment offers a scalable path toward more interpretable and robust LLM safety in practical deployments.

Abstract

Large Language Models (LLMs) are vulnerable to jailbreak attacks that exploit weaknesses in traditional safety alignment, which often relies on rigid refusal heuristics or representation engineering to block harmful outputs. While they are effective for direct adversarial attacks, they fall short of broader safety challenges requiring nuanced, context-aware decision-making. To address this, we propose Reasoning-enhanced Finetuning for interpretable LLM Safety (Rational), a novel framework that trains models to engage in explicit safe reasoning before response. Fine-tuned models leverage the extensive pretraining knowledge in self-generated reasoning to bootstrap their own safety through structured reasoning, internalizing context-sensitive decision-making. Our findings suggest that safety extends beyond refusal, requiring context awareness for more robust, interpretable, and adaptive responses. Reasoning is not only a core capability of LLMs but also a fundamental mechanism for LLM safety. Rational employs reasoning-enhanced fine-tuning, allowing it to reject harmful prompts while providing meaningful and context-aware responses in complex scenarios.

Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety

TL;DR

This paper tackles the vulnerability of LLM safety to jailbreaks that bypass static refusal strategies. It introduces Rational, a reasoning-enhanced fine-tuning framework that trains models to explicitly reason about safety using self-generated safety rationales before responding, thereby rejecting unsafe prompts with justification while remaining helpful on benign inputs. By curating a Rationale Dataset that pairs prompts with explicit reasoning and leveraging LoRA-based supervised fine-tuning, Rational achieves state-of-the-art robustness against reasoning-based and gradient-based attacks, generalizes to unseen threats, and improves factual correctness and toxicity detection. The approach also reveals that safety and helpfulness can be balanced, particularly when benign rationales are included in training, though challenges remain in multi-turn settings and evolving attack strategies. Overall, reasoning-based alignment offers a scalable path toward more interpretable and robust LLM safety in practical deployments.

Abstract

Large Language Models (LLMs) are vulnerable to jailbreak attacks that exploit weaknesses in traditional safety alignment, which often relies on rigid refusal heuristics or representation engineering to block harmful outputs. While they are effective for direct adversarial attacks, they fall short of broader safety challenges requiring nuanced, context-aware decision-making. To address this, we propose Reasoning-enhanced Finetuning for interpretable LLM Safety (Rational), a novel framework that trains models to engage in explicit safe reasoning before response. Fine-tuned models leverage the extensive pretraining knowledge in self-generated reasoning to bootstrap their own safety through structured reasoning, internalizing context-sensitive decision-making. Our findings suggest that safety extends beyond refusal, requiring context awareness for more robust, interpretable, and adaptive responses. Reasoning is not only a core capability of LLMs but also a fundamental mechanism for LLM safety. Rational employs reasoning-enhanced fine-tuning, allowing it to reject harmful prompts while providing meaningful and context-aware responses in complex scenarios.

Paper Structure

This paper contains 41 sections, 3 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Comparison of safety responses between a circuit breaker model and proposed Rational. The circuit breaker blocks harmful output but produces incoherent responses, failing to address the sensitive query from CoCoNotbrahman2024art. In contrast, our model reasons through intent and ethics, providing a clear, supportive response. Judging model output is generated by the GPT judge as in brahman2024art.
  • Figure 2: Accuracy comparison on TruthfulQA and ToxiGen benchmarks for LLaMA-3-8B-Instruct (left) and Mistral-7B-Instruct (right). Our proposed method, Rational, outperforms both the base models and the Circuit Breaker baseline across both benchmarks, demonstrating improved truthfulness and toxicity mitigation.
  • Figure 3: Tradeoff between robustness and helpfulness. Rational fine-tuned Mistral model improves helpfulness without compromising robustness.
  • Figure 4: Accuracy comparison on Open LLM Evaluation eval-harness for LLaMA-3-8B-Instruct (left) and Mistral-7B-Instruct (right). Our proposed method, Rational, has comparable performance with both the base models and the Circuit Breaker baseline across all benchmarks.