Table of Contents
Fetching ...

QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning

Doyeon Lee, Eunyi Lyou, Hyunsoo Cho, Sookyung Kim, Joonseok Lee, Jaemoo Choi

TL;DR

QUATRO addresses instability in RL-based LLM fine-tuning due to heuristic trust-region approximations by introducing a query-adaptive trust-region objective derived via Lagrangian duality. It enforces per-prompt KL constraints, yielding adaptive update magnitudes through per-prompt dual variables and a stabilizing log-ratio term. Empirically, QUATRO outperforms GRPO-style baselines on six mathematical benchmarks, achieving higher Pass@k and Unique Correct Count while preserving entropy and exploration. The approach reduces gradient masking and entropy collapse, enabling robust training under policy staleness and aggressive learning rates, with larger gains at higher sampling budgets. It provides a unified framework linking TRPO, KL-regularization, and prompt-conditioned optimization for scalable LLM fine-tuning.

Abstract

GRPO-style reinforcement learning (RL)-based LLM fine-tuning algorithms have recently gained popularity. Relying on heuristic trust-region approximations, however, they can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to regulate samples whose importance ratios fall outside the clipping range. We propose Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through a principled optimization. This yields a clear and interpretable objective that enables explicit control over policy updates and stable, entropy-controlled optimization, with a stabilizer terms arising intrinsically from the exact trust-region formulation. Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning rates, maintaining well-controlled entropy throughout training.

QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning

TL;DR

QUATRO addresses instability in RL-based LLM fine-tuning due to heuristic trust-region approximations by introducing a query-adaptive trust-region objective derived via Lagrangian duality. It enforces per-prompt KL constraints, yielding adaptive update magnitudes through per-prompt dual variables and a stabilizing log-ratio term. Empirically, QUATRO outperforms GRPO-style baselines on six mathematical benchmarks, achieving higher Pass@k and Unique Correct Count while preserving entropy and exploration. The approach reduces gradient masking and entropy collapse, enabling robust training under policy staleness and aggressive learning rates, with larger gains at higher sampling budgets. It provides a unified framework linking TRPO, KL-regularization, and prompt-conditioned optimization for scalable LLM fine-tuning.

Abstract

GRPO-style reinforcement learning (RL)-based LLM fine-tuning algorithms have recently gained popularity. Relying on heuristic trust-region approximations, however, they can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to regulate samples whose importance ratios fall outside the clipping range. We propose Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through a principled optimization. This yields a clear and interpretable objective that enables explicit control over policy updates and stable, entropy-controlled optimization, with a stabilizer terms arising intrinsically from the exact trust-region formulation. Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning rates, maintaining well-controlled entropy throughout training.
Paper Structure (41 sections, 3 theorems, 80 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 41 sections, 3 theorems, 80 equations, 11 figures, 5 tables, 1 algorithm.

Key Result

Proposition 3.1

Let function $f_q:\mathbb{R}\rightarrow \mathbb{R}$ be defined as Given that the following holds:

Figures (11)

  • Figure 1: Limitations of importance-ratio clipping of GSPO Training dynamics of GSPO under increasing learning rates (a) and repeated offline inner-loop updates (b), plotted as training steps.
  • Figure 2: Prompt-wise update magnitude (left) and training-time evolution of the Lagrange multiplier $\lambda$ under different $\delta$ (right).
  • Figure 3: Entropy dynamics comparison across training steps. The baseline exhibits a steady entropy collapse, while our method maintains it at a controlled and stable level throughout training.
  • Figure 4: Robustness to policy staleness and aggressive updates. Both increased offline inner-loop updates (a) and larger learning rates (b) amplify policy mismatch, leading GSPO to unstable rewards and entropy collapse. In contrast, our method remains stable across these settings, enabled by explicit trust-region control and the log-ratio stabilizer (c).
  • Figure : Prompt-wise update magnitude. Update strength decreases as the number of correct rollouts $c$ increases, with tighter trust regions (smaller $\delta$) yielding more conservative updates.
  • ...and 6 more figures

Theorems & Definitions (5)

  • Proposition 3.1
  • Remark 3.2
  • Theorem 3.3: QUATRO Objective
  • Remark 3.4: Connection to GSPO
  • Theorem 3.5