Table of Contents
Fetching ...

RAJ-PGA: Reasoning-Activated Jailbreak and Principle-Guided Alignment Framework for Large Reasoning Models

Jianhao Chen, Mayi Xu, Haoyang Chen, Xiaohu Li, Xiangyu Zhang, Jianjie Huang, Zheng Wang, Xiaochun Cao, Tieyun Qian

TL;DR

This work tackles the safety vulnerabilities arising from the reasoning processes of Large Reasoning Models (LRMs). It introduces a two-stage paradigm, Reasoning-Activated Jailbreak via Concretization (RAJ) to reveal latent vulnerabilities, and Principle-Guided Alignment (PGA) to rewrite harmful reasoning traces into safe, instructional content. The authors release the RAJ and PGA datasets (3,989 PGA-aligned samples) and demonstrate that fine-tuning with PGA significantly improves defense success rates (up to 29.5%) across jailbreak benchmarks while preserving or enhancing general reasoning. The approach offers a scalable, data-driven path to align reasoning-intensive AI with human values, addressing safety without sacrificing deductive capabilities and reducing the alignment tax across diverse model architectures.

Abstract

Large Reasoning Models (LRMs) face a distinct safety vulnerability: their internal reasoning chains may generate harmful content even when the final output appears benign. To address this overlooked risk, we first propose a novel attack paradigm, Reasoning-Activated Jailbreak (RAJ) via Concretization, which demonstrates that refining malicious prompts to be more specific can trigger step-by-step logical reasoning that overrides the model's safety protocols. To systematically mitigate this vulnerability, we further develop a scalable framework for constructing high-quality safety alignment datasets. This framework first leverages the RAJ attack to elicit challenging harmful reasoning chains from LRMs, then transforms these high-risk traces into safe, constructive, and educational responses through a tailored Principle-Guided Alignment (PGA) mechanism. Then, we introduce the PGA dataset, a verified alignment dataset containing 3,989 samples using our proposed method. Extensive experiments show that fine-tuning LRMs with PGA dataset significantly enhances model safety, achieving up to a 29.5% improvement in defense success rates across multiple jailbreak benchmarks. Critically, our approach not only defends against sophisticated reasoning-based attacks but also preserves, even enhances, the model's general reasoning capabilities. This work provides a scalable and effective pathway for safety alignment in reasoning-intensive AI systems, addressing the core trade-off between safety and functional performance.

RAJ-PGA: Reasoning-Activated Jailbreak and Principle-Guided Alignment Framework for Large Reasoning Models

TL;DR

This work tackles the safety vulnerabilities arising from the reasoning processes of Large Reasoning Models (LRMs). It introduces a two-stage paradigm, Reasoning-Activated Jailbreak via Concretization (RAJ) to reveal latent vulnerabilities, and Principle-Guided Alignment (PGA) to rewrite harmful reasoning traces into safe, instructional content. The authors release the RAJ and PGA datasets (3,989 PGA-aligned samples) and demonstrate that fine-tuning with PGA significantly improves defense success rates (up to 29.5%) across jailbreak benchmarks while preserving or enhancing general reasoning. The approach offers a scalable, data-driven path to align reasoning-intensive AI with human values, addressing safety without sacrificing deductive capabilities and reducing the alignment tax across diverse model architectures.

Abstract

Large Reasoning Models (LRMs) face a distinct safety vulnerability: their internal reasoning chains may generate harmful content even when the final output appears benign. To address this overlooked risk, we first propose a novel attack paradigm, Reasoning-Activated Jailbreak (RAJ) via Concretization, which demonstrates that refining malicious prompts to be more specific can trigger step-by-step logical reasoning that overrides the model's safety protocols. To systematically mitigate this vulnerability, we further develop a scalable framework for constructing high-quality safety alignment datasets. This framework first leverages the RAJ attack to elicit challenging harmful reasoning chains from LRMs, then transforms these high-risk traces into safe, constructive, and educational responses through a tailored Principle-Guided Alignment (PGA) mechanism. Then, we introduce the PGA dataset, a verified alignment dataset containing 3,989 samples using our proposed method. Extensive experiments show that fine-tuning LRMs with PGA dataset significantly enhances model safety, achieving up to a 29.5% improvement in defense success rates across multiple jailbreak benchmarks. Critically, our approach not only defends against sophisticated reasoning-based attacks but also preserves, even enhances, the model's general reasoning capabilities. This work provides a scalable and effective pathway for safety alignment in reasoning-intensive AI systems, addressing the core trade-off between safety and functional performance.

Paper Structure

This paper contains 31 sections, 8 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Conceptual Framework of the Safety-Reasoning Dilemma and the Concretization-Based Jailbreak Attack. (a) The inherent competition between reasoning goals and safety constraints. (b) Performance shift in vertical/lateral thinking and ASR before and after concretization. (c) Illustration of how detailed prompts bypass safety filters by activating vertical thinking trajectories.
  • Figure 2: Architectural Overview of the Reasoning-Activated Jailbreak and Principle-Guided Alignment Framework. The pipeline comprises three stages: (1) Jailbreak Generation, using Concretization to transform original harmful prompts into reasoning-activated harmful prompts, prompting the victim LRM generate controversial contents; (2) Alignment Transformation, employing a consensus-based safety validation and Principle-Guided Alignment to rewrite harmful contents into principle-aligned contents; (3) Model Alignment, where LRMs' safety performance is enhanced by Principle-Guided Alignment strategy.
  • Figure 3: Comparison Between $\text{ASR}_{\text{Original}}$ and $\text{ASR}_{\text{Concretized}}$ in Reasoning and Response Phases. $\Delta$ denotes $\text{ASR}_{\text{Concretized}} - \text{ASR}_{\text{Original}}$.
  • Figure 4: Differential word clouds comparing vocabulary changes between original and concretized texts across three components: (a) Prompt, (b) Reasoning, and (c) Response. Red tones indicate words that significantly increased in frequency after concretization, while green tones indicate words that significantly decreased. Word size reflects frequency and change magnitude.
  • Figure 5: Trade-off Analysis between Model Safety (DSR, in %) and General Reasoning Capabilities. The PGA framework (red markers) consistently achieves a superior Pareto frontier compared to baselines across multiple model backbones.
  • ...and 1 more figures