Trust-Region Adaptive Policy Optimization

Mingyu Su; Jian Guan; Yuxian Gu; Minlie Huang; Hongning Wang

Trust-Region Adaptive Policy Optimization

Mingyu Su, Jian Guan, Yuxian Gu, Minlie Huang, Hongning Wang

TL;DR

TRAPO introduces a one-stage post-training paradigm that interleaves Supervised Fine-Tuning and Reinforcement Learning at the instance level. It couples Trust-Region SFT, which stabilizes knowledge absorption by shifting from forward KL to reverse KL, with an adaptive prefix-guidance mechanism and micro-group sampling to balance exploration and imitation. The approach yields consistent improvements across five mathematical reasoning benchmarks and general-domain tasks, outperforming SFT, RL, and SFT-then-RL baselines and extending the model's reasoning capabilities. These results establish TRAPO as a robust paradigm for reasoning-enhanced LLMs with stronger test-time scaling. The framework is supported by theoretical insights into KL-divergence behavior and practical ablations demonstrating the contributions of TrSFT and adaptive guidance.

Abstract

Post-training methods, especially Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), play an important role in improving large language models' (LLMs) complex reasoning abilities. However, the dominant two-stage pipeline (SFT then RL) suffers from a key inconsistency: SFT enforces rigid imitation that suppresses exploration and induces forgetting, limiting RL's potential for improvements. We address this inefficiency with TRAPO (\textbf{T}rust-\textbf{R}egion \textbf{A}daptive \textbf{P}olicy \textbf{O}ptimization), a hybrid framework that interleaves SFT and RL within each training instance by optimizing SFT loss on expert prefixes and RL loss on the model's own completions, unifying external supervision and self-exploration. To stabilize training, we introduce Trust-Region SFT (TrSFT), which minimizes forward KL divergence inside a trust region but attenuates optimization outside, effectively shifting toward reverse KL and yielding stable, mode-seeking updates favorable for RL. An adaptive prefix-selection mechanism further allocates expert guidance based on measured utility. Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines, as well as recent state-of-the-art approaches, establishing a strong new paradigm for reasoning-enhanced LLMs.

Trust-Region Adaptive Policy Optimization

TL;DR

Abstract

Trust-Region Adaptive Policy Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (3)