Table of Contents
Fetching ...

Trust-Region Adaptive Policy Optimization

Mingyu Su, Jian Guan, Yuxian Gu, Minlie Huang, Hongning Wang

TL;DR

TRAPO introduces a one-stage post-training paradigm that interleaves Supervised Fine-Tuning and Reinforcement Learning at the instance level. It couples Trust-Region SFT, which stabilizes knowledge absorption by shifting from forward KL to reverse KL, with an adaptive prefix-guidance mechanism and micro-group sampling to balance exploration and imitation. The approach yields consistent improvements across five mathematical reasoning benchmarks and general-domain tasks, outperforming SFT, RL, and SFT-then-RL baselines and extending the model's reasoning capabilities. These results establish TRAPO as a robust paradigm for reasoning-enhanced LLMs with stronger test-time scaling. The framework is supported by theoretical insights into KL-divergence behavior and practical ablations demonstrating the contributions of TrSFT and adaptive guidance.

Abstract

Post-training methods, especially Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), play an important role in improving large language models' (LLMs) complex reasoning abilities. However, the dominant two-stage pipeline (SFT then RL) suffers from a key inconsistency: SFT enforces rigid imitation that suppresses exploration and induces forgetting, limiting RL's potential for improvements. We address this inefficiency with TRAPO (\textbf{T}rust-\textbf{R}egion \textbf{A}daptive \textbf{P}olicy \textbf{O}ptimization), a hybrid framework that interleaves SFT and RL within each training instance by optimizing SFT loss on expert prefixes and RL loss on the model's own completions, unifying external supervision and self-exploration. To stabilize training, we introduce Trust-Region SFT (TrSFT), which minimizes forward KL divergence inside a trust region but attenuates optimization outside, effectively shifting toward reverse KL and yielding stable, mode-seeking updates favorable for RL. An adaptive prefix-selection mechanism further allocates expert guidance based on measured utility. Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines, as well as recent state-of-the-art approaches, establishing a strong new paradigm for reasoning-enhanced LLMs.

Trust-Region Adaptive Policy Optimization

TL;DR

TRAPO introduces a one-stage post-training paradigm that interleaves Supervised Fine-Tuning and Reinforcement Learning at the instance level. It couples Trust-Region SFT, which stabilizes knowledge absorption by shifting from forward KL to reverse KL, with an adaptive prefix-guidance mechanism and micro-group sampling to balance exploration and imitation. The approach yields consistent improvements across five mathematical reasoning benchmarks and general-domain tasks, outperforming SFT, RL, and SFT-then-RL baselines and extending the model's reasoning capabilities. These results establish TRAPO as a robust paradigm for reasoning-enhanced LLMs with stronger test-time scaling. The framework is supported by theoretical insights into KL-divergence behavior and practical ablations demonstrating the contributions of TrSFT and adaptive guidance.

Abstract

Post-training methods, especially Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), play an important role in improving large language models' (LLMs) complex reasoning abilities. However, the dominant two-stage pipeline (SFT then RL) suffers from a key inconsistency: SFT enforces rigid imitation that suppresses exploration and induces forgetting, limiting RL's potential for improvements. We address this inefficiency with TRAPO (\textbf{T}rust-\textbf{R}egion \textbf{A}daptive \textbf{P}olicy \textbf{O}ptimization), a hybrid framework that interleaves SFT and RL within each training instance by optimizing SFT loss on expert prefixes and RL loss on the model's own completions, unifying external supervision and self-exploration. To stabilize training, we introduce Trust-Region SFT (TrSFT), which minimizes forward KL divergence inside a trust region but attenuates optimization outside, effectively shifting toward reverse KL and yielding stable, mode-seeking updates favorable for RL. An adaptive prefix-selection mechanism further allocates expert guidance based on measured utility. Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines, as well as recent state-of-the-art approaches, establishing a strong new paradigm for reasoning-enhanced LLMs.

Paper Structure

This paper contains 49 sections, 2 theorems, 18 equations, 9 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

Let $S(\lambda) = \{\, c \mid p_{\text{E}}(c) > \alpha \lambda, c\in\mathcal{C} \}$ and $\mathcal{C}$ is the vocabulary. There exists a unique$\lambda \in (0, 1)$ such that $\lambda = \sum_{c \in S(\lambda)} p_{\text{E}}(c)\;$. And for this $\lambda$, the optimal solution of the optimization problem

Figures (9)

  • Figure 1: An overview of our TRAPO framework, which synergistically combines two key components: Trust-Region SFT (TrSFT) and Adaptive Expert Guidance. Left: By clipping the gradient weight with a trust-region parameter $\alpha$, TrSFT prevents exploding gradients on low-probability tokens, ensuring a stable learning signal when combining with RL. Right: The adaptive guidance mechanism implements a "learn-while-practicing" loop. When the target model fails a rollout, its cumulative return dynamically dictates the length of an expert prefix provided for guidance. The model then continues generation and the full trajectory is optimized using both the TrSFT loss on expert prefixes and a standard RL objective on the model rollout.
  • Figure 2: Accuracy and characteristic of Qwen2.5-3B-Instruct reasoning on MATH-500 with different amount of tokens from DeepSeek-R1 as prefixes.
  • Figure 3: An illustrative experiment showing the training dynamics during SFT. Panel (a) shows snapshots of learnt target policy at four consecutive training phases, corresponding to training steps of 0, 50, 100, and 1000, respectively. Panel (b) presents the KL divergence curve along with the change in cumulative probability of the target policy within the void regions.
  • Figure 4: Training dynamics of TRAPO compared with GRPO. From left to right: average reward, generation length, and output entropy during training. For fair comparison, both reward and generation length are computed by excluding trajectories guided by expert prefixes.
  • Figure 5: Average accuracy across five mathematical benchmarks.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 1
  • proof