RAJ-PGA: Reasoning-Activated Jailbreak and Principle-Guided Alignment Framework for Large Reasoning Models
Jianhao Chen, Mayi Xu, Haoyang Chen, Xiaohu Li, Xiangyu Zhang, Jianjie Huang, Zheng Wang, Xiaochun Cao, Tieyun Qian
TL;DR
This work tackles the safety vulnerabilities arising from the reasoning processes of Large Reasoning Models (LRMs). It introduces a two-stage paradigm, Reasoning-Activated Jailbreak via Concretization (RAJ) to reveal latent vulnerabilities, and Principle-Guided Alignment (PGA) to rewrite harmful reasoning traces into safe, instructional content. The authors release the RAJ and PGA datasets (3,989 PGA-aligned samples) and demonstrate that fine-tuning with PGA significantly improves defense success rates (up to 29.5%) across jailbreak benchmarks while preserving or enhancing general reasoning. The approach offers a scalable, data-driven path to align reasoning-intensive AI with human values, addressing safety without sacrificing deductive capabilities and reducing the alignment tax across diverse model architectures.
Abstract
Large Reasoning Models (LRMs) face a distinct safety vulnerability: their internal reasoning chains may generate harmful content even when the final output appears benign. To address this overlooked risk, we first propose a novel attack paradigm, Reasoning-Activated Jailbreak (RAJ) via Concretization, which demonstrates that refining malicious prompts to be more specific can trigger step-by-step logical reasoning that overrides the model's safety protocols. To systematically mitigate this vulnerability, we further develop a scalable framework for constructing high-quality safety alignment datasets. This framework first leverages the RAJ attack to elicit challenging harmful reasoning chains from LRMs, then transforms these high-risk traces into safe, constructive, and educational responses through a tailored Principle-Guided Alignment (PGA) mechanism. Then, we introduce the PGA dataset, a verified alignment dataset containing 3,989 samples using our proposed method. Extensive experiments show that fine-tuning LRMs with PGA dataset significantly enhances model safety, achieving up to a 29.5% improvement in defense success rates across multiple jailbreak benchmarks. Critically, our approach not only defends against sophisticated reasoning-based attacks but also preserves, even enhances, the model's general reasoning capabilities. This work provides a scalable and effective pathway for safety alignment in reasoning-intensive AI systems, addressing the core trade-off between safety and functional performance.
