SaRO: Enhancing LLM Safety through Reasoning-based Alignment
Yutao Mou, Yuxiao Luo, Shikun Zhang, Wei Ye
TL;DR
SaRO addresses safety alignment challenges in LLMs by introducing a two-stage, reasoning-based framework: Reasoning-style Warmup (RW) to internalize long-chain safety reasoning, and Safety-oriented Reasoning Process Optimization (SRPO) to refine safety reflection via Direct Preference Optimization. It builds specialized datasets (RIT-D, OP-COT, PP-COT) and demonstrates through extensive experiments that SaRO improves defense against jailbreaks and reduces over-refusal while maintaining, and sometimes enhancing, general capabilities. The approach also introduces efficiency measures like Shortest Rejection Sampling to mitigate increased reasoning costs. These results suggest a practical path toward applying safety-policy-aware reasoning in GPT-like models without sacrificing usability.
Abstract
Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary investigation reveals semantic overlap between jailbreak/harmful queries and normal prompts in embedding space, suggesting that more effective safety alignment requires a deeper semantic understanding. This motivates us to incorporate safety-policy-driven reasoning into the alignment process. To this end, we propose the Safety-oriented Reasoning Optimization Framework (SaRO), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning through supervised fine-tuning, and (2) Safety-oriented Reasoning Process Optimization (SRPO) that promotes safety reflection via direct preference optimization (DPO). Extensive experiments demonstrate the superiority of SaRO over traditional alignment methods.
