Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety
Yuyou Zhang, Miao Li, William Han, Yihang Yao, Zhepeng Cen, Ding Zhao
TL;DR
This paper tackles the vulnerability of LLM safety to jailbreaks that bypass static refusal strategies. It introduces Rational, a reasoning-enhanced fine-tuning framework that trains models to explicitly reason about safety using self-generated safety rationales before responding, thereby rejecting unsafe prompts with justification while remaining helpful on benign inputs. By curating a Rationale Dataset that pairs prompts with explicit reasoning and leveraging LoRA-based supervised fine-tuning, Rational achieves state-of-the-art robustness against reasoning-based and gradient-based attacks, generalizes to unseen threats, and improves factual correctness and toxicity detection. The approach also reveals that safety and helpfulness can be balanced, particularly when benign rationales are included in training, though challenges remain in multi-turn settings and evolving attack strategies. Overall, reasoning-based alignment offers a scalable path toward more interpretable and robust LLM safety in practical deployments.
Abstract
Large Language Models (LLMs) are vulnerable to jailbreak attacks that exploit weaknesses in traditional safety alignment, which often relies on rigid refusal heuristics or representation engineering to block harmful outputs. While they are effective for direct adversarial attacks, they fall short of broader safety challenges requiring nuanced, context-aware decision-making. To address this, we propose Reasoning-enhanced Finetuning for interpretable LLM Safety (Rational), a novel framework that trains models to engage in explicit safe reasoning before response. Fine-tuned models leverage the extensive pretraining knowledge in self-generated reasoning to bootstrap their own safety through structured reasoning, internalizing context-sensitive decision-making. Our findings suggest that safety extends beyond refusal, requiring context awareness for more robust, interpretable, and adaptive responses. Reasoning is not only a core capability of LLMs but also a fundamental mechanism for LLM safety. Rational employs reasoning-enhanced fine-tuning, allowing it to reject harmful prompts while providing meaningful and context-aware responses in complex scenarios.
