Table of Contents
Fetching ...

STAIR: Improving Safety Alignment with Introspective Reasoning

Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, Jun Zhu

TL;DR

STAIR embeds safety-aware introspective reasoning into LLMs by combining structured chain-of-thought data, Safety-Informed Monte Carlo Tree Search, and a test-time Process Reward Model. Through iterative self-improvement with stepwise preference optimization, it balances safety and helpfulness and mitigates safety-performance trade-offs. Test-time scaling with BoN and Beam Search further boosts safety against jailbreak attacks, achieving Claude-3.5–level safety in benchmarks while preserving performance across reasoning and factual tasks. The framework requires no external evaluators and uses self-generated data and rewards to guide improvements, offering practical resilience for real-world deployment.

Abstract

Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and the susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we propose STAIR, a novel framework that integrates SafeTy Alignment with Itrospective Reasoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. Relevant resources in this work are available at https://github.com/thu-ml/STAIR.

STAIR: Improving Safety Alignment with Introspective Reasoning

TL;DR

STAIR embeds safety-aware introspective reasoning into LLMs by combining structured chain-of-thought data, Safety-Informed Monte Carlo Tree Search, and a test-time Process Reward Model. Through iterative self-improvement with stepwise preference optimization, it balances safety and helpfulness and mitigates safety-performance trade-offs. Test-time scaling with BoN and Beam Search further boosts safety against jailbreak attacks, achieving Claude-3.5–level safety in benchmarks while preserving performance across reasoning and factual tasks. The framework requires no external evaluators and uses self-generated data and rewards to guide improvements, offering practical resilience for real-world deployment.

Abstract

Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and the susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we propose STAIR, a novel framework that integrates SafeTy Alignment with Itrospective Reasoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. Relevant resources in this work are available at https://github.com/thu-ml/STAIR.

Paper Structure

This paper contains 28 sections, 3 theorems, 13 equations, 8 figures, 7 tables, 1 algorithm.

Key Result

Theorem 2.1

Fix constants $C_1, C_2\in [-1,1],\;C_1\ne0$. Suppose $R:[-1,1]\times[-1,1]\rightarrow \mathbb{R}$ is twice-differentiable and satisfies $\frac{\partial R}{\partial H}=F(S)$, for some continuous function $F: [-1,1]\rightarrow \mathbb{R}$. The last two properties hold if and only if with $F(0)=0, F(C_1)=1, \forall S>0, F(S)>0, \forall S<0, F(S)<0$ and $c$ as a constant.

Figures (8)

  • Figure 1: Although the existing safety alignment methods enable LLMs to refuse queries with apparent risks directly, they often fail to resist jailbreak attacks that manage to avoid the initial tokens for refusal. The instinctive responses correspond to System 1 thinking. In this paper, we propose to improve safety alignment with introspective reasoning, encouraging LLMs to scrutinize the underlying risks with safety-aware System 2 thinking before making refusals.
  • Figure 2: The framework of STAIR consists of 3 stages. First, a model is initially trained on structured CoT data generated by prompting GPT-4o. It is then used to construct Safety-Informed MCTS (SI-MCTS) through self-generation and self-rewarding. The safety-informed reward function in this process incorporates the information of safety with helpfulness into the internal search nodes. From the constructed search trees, a stepwise preference dataset is collected with threshold sampling for optimizing the model via step-level DPO. This self-improvement process can be repeated for $K=3$ iterations. Finally, a process reward model (PRM) can be further trained based on the same search trees and guide the model from the last iteration to generate better and safer responses through test-time search algorithms.
  • Figure 3: Changes in goodness scores on StrongReject with test-time scaling.
  • Figure 4: Changes in winning rates on AlpacaEval when with test-time scaling.
  • Figure 5: Results on StrongReject and AlpacaEval as the ratio of safety data varies.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem 2.1
  • Theorem 2.1
  • proof
  • Corollary 2.2