Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines
Yuhang Wang, Yanxu Zhu, Dongyuan Lu, Jitao Sang
TL;DR
The paper tackles safety gaps in reasoning models facing covert adversarial jailbreak prompts by introducing SGASA, a two-stage framework that self-generates safety guidelines and augments prompts, then internalizes them through SFT and DPO with LoRA fine-tuning. Data Pre-synthesis and Alignment Fine-tuning enable adaptive, model-driven defense that balances safety with minimizing unnecessary refusals. Across two backbones and three datasets, SGASA substantially improves safety metrics, with DPO-based variants delivering the strongest gains and demonstrating cross-dataset generalization. The work highlights scalable, automated safety alignment that can adapt to evolving jailbreak strategies and informs future directions for robust, guideline-driven defense mechanisms.
Abstract
Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models' ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and augmented prompts; and Alignment Fine-tuning, which leverages Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed these guidelines into the model. Extensive experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.
