Table of Contents
Fetching ...

SaRO: Enhancing LLM Safety through Reasoning-based Alignment

Yutao Mou, Yuxiao Luo, Shikun Zhang, Wei Ye

TL;DR

SaRO addresses safety alignment challenges in LLMs by introducing a two-stage, reasoning-based framework: Reasoning-style Warmup (RW) to internalize long-chain safety reasoning, and Safety-oriented Reasoning Process Optimization (SRPO) to refine safety reflection via Direct Preference Optimization. It builds specialized datasets (RIT-D, OP-COT, PP-COT) and demonstrates through extensive experiments that SaRO improves defense against jailbreaks and reduces over-refusal while maintaining, and sometimes enhancing, general capabilities. The approach also introduces efficiency measures like Shortest Rejection Sampling to mitigate increased reasoning costs. These results suggest a practical path toward applying safety-policy-aware reasoning in GPT-like models without sacrificing usability.

Abstract

Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary investigation reveals semantic overlap between jailbreak/harmful queries and normal prompts in embedding space, suggesting that more effective safety alignment requires a deeper semantic understanding. This motivates us to incorporate safety-policy-driven reasoning into the alignment process. To this end, we propose the Safety-oriented Reasoning Optimization Framework (SaRO), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning through supervised fine-tuning, and (2) Safety-oriented Reasoning Process Optimization (SRPO) that promotes safety reflection via direct preference optimization (DPO). Extensive experiments demonstrate the superiority of SaRO over traditional alignment methods.

SaRO: Enhancing LLM Safety through Reasoning-based Alignment

TL;DR

SaRO addresses safety alignment challenges in LLMs by introducing a two-stage, reasoning-based framework: Reasoning-style Warmup (RW) to internalize long-chain safety reasoning, and Safety-oriented Reasoning Process Optimization (SRPO) to refine safety reflection via Direct Preference Optimization. It builds specialized datasets (RIT-D, OP-COT, PP-COT) and demonstrates through extensive experiments that SaRO improves defense against jailbreaks and reduces over-refusal while maintaining, and sometimes enhancing, general capabilities. The approach also introduces efficiency measures like Shortest Rejection Sampling to mitigate increased reasoning costs. These results suggest a practical path toward applying safety-policy-aware reasoning in GPT-like models without sacrificing usability.

Abstract

Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary investigation reveals semantic overlap between jailbreak/harmful queries and normal prompts in embedding space, suggesting that more effective safety alignment requires a deeper semantic understanding. This motivates us to incorporate safety-policy-driven reasoning into the alignment process. To this end, we propose the Safety-oriented Reasoning Optimization Framework (SaRO), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning through supervised fine-tuning, and (2) Safety-oriented Reasoning Process Optimization (SRPO) that promotes safety reflection via direct preference optimization (DPO). Extensive experiments demonstrate the superiority of SaRO over traditional alignment methods.

Paper Structure

This paper contains 34 sections, 2 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Illustration of alignment limitations: (a) Over-refusal of benign queries (over-alignment), (b) Susceptibility to jailbreak queries (under-generalization), (c) Possible causes: for LLaMA3, benign query embeddings are closer to harmful ones, leading to over-alignment; for Qwen2, jailbreak embeddings align with general instructions, resulting in under-generalization.
  • Figure 2: Data construction pipeline for SaRO.
  • Figure 3: Visualization of semantic embeddings of different instruction types.
  • Figure 4: Upper: Accuracy of judging safe or unsafe outputs on the validation set during training process. Lower: Reward margins between safe and unsafe outputs on the validation set during training.
  • Figure 5: Statistics of reflection and self-correction patterns in mathematical reasoning for LLMs trained with different safety alignment methods.
  • ...and 4 more figures