QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning

Doyeon Lee; Eunyi Lyou; Hyunsoo Cho; Sookyung Kim; Joonseok Lee; Jaemoo Choi

QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning

Doyeon Lee, Eunyi Lyou, Hyunsoo Cho, Sookyung Kim, Joonseok Lee, Jaemoo Choi

TL;DR

QUATRO addresses instability in RL-based LLM fine-tuning due to heuristic trust-region approximations by introducing a query-adaptive trust-region objective derived via Lagrangian duality. It enforces per-prompt KL constraints, yielding adaptive update magnitudes through per-prompt dual variables and a stabilizing log-ratio term. Empirically, QUATRO outperforms GRPO-style baselines on six mathematical benchmarks, achieving higher Pass@k and Unique Correct Count while preserving entropy and exploration. The approach reduces gradient masking and entropy collapse, enabling robust training under policy staleness and aggressive learning rates, with larger gains at higher sampling budgets. It provides a unified framework linking TRPO, KL-regularization, and prompt-conditioned optimization for scalable LLM fine-tuning.

Abstract

GRPO-style reinforcement learning (RL)-based LLM fine-tuning algorithms have recently gained popularity. Relying on heuristic trust-region approximations, however, they can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to regulate samples whose importance ratios fall outside the clipping range. We propose Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through a principled optimization. This yields a clear and interpretable objective that enables explicit control over policy updates and stable, entropy-controlled optimization, with a stabilizer terms arising intrinsically from the exact trust-region formulation. Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning rates, maintaining well-controlled entropy throughout training.

QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning

TL;DR

Abstract

Paper Structure (41 sections, 3 theorems, 80 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 41 sections, 3 theorems, 80 equations, 11 figures, 5 tables, 1 algorithm.

Introduction
Trust-Region Policy Optimization
KL-Constrained Trust-Region Optimization
Limitations of Existing Trust-Region Methods
Heuristic Importance-ratio Clipping.
Method
Prompt-wise Trust-Region Policy Optimization
Objective Function and Algorithm
Interpretation and Key Observations
Experiments
Experimental Setup
Results and Analysis
Stability Analysis
Related Work
Conclusion
...and 26 more sections

Key Result

Proposition 3.1

Let function $f_q:\mathbb{R}\rightarrow \mathbb{R}$ be defined as Given that the following holds:

Figures (11)

Figure 1: Limitations of importance-ratio clipping of GSPO Training dynamics of GSPO under increasing learning rates (a) and repeated offline inner-loop updates (b), plotted as training steps.
Figure 2: Prompt-wise update magnitude (left) and training-time evolution of the Lagrange multiplier $\lambda$ under different $\delta$ (right).
Figure 3: Entropy dynamics comparison across training steps. The baseline exhibits a steady entropy collapse, while our method maintains it at a controlled and stable level throughout training.
Figure 4: Robustness to policy staleness and aggressive updates. Both increased offline inner-loop updates (a) and larger learning rates (b) amplify policy mismatch, leading GSPO to unstable rewards and entropy collapse. In contrast, our method remains stable across these settings, enabled by explicit trust-region control and the log-ratio stabilizer (c).
Figure : Prompt-wise update magnitude. Update strength decreases as the number of correct rollouts $c$ increases, with tighter trust regions (smaller $\delta$) yielding more conservative updates.
...and 6 more figures

Theorems & Definitions (5)

Proposition 3.1
Remark 3.2
Theorem 3.3: QUATRO Objective
Remark 3.4: Connection to GSPO
Theorem 3.5

QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning

TL;DR

Abstract

QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (5)