QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning
Doyeon Lee, Eunyi Lyou, Hyunsoo Cho, Sookyung Kim, Joonseok Lee, Jaemoo Choi
TL;DR
QUATRO addresses instability in RL-based LLM fine-tuning due to heuristic trust-region approximations by introducing a query-adaptive trust-region objective derived via Lagrangian duality. It enforces per-prompt KL constraints, yielding adaptive update magnitudes through per-prompt dual variables and a stabilizing log-ratio term. Empirically, QUATRO outperforms GRPO-style baselines on six mathematical benchmarks, achieving higher Pass@k and Unique Correct Count while preserving entropy and exploration. The approach reduces gradient masking and entropy collapse, enabling robust training under policy staleness and aggressive learning rates, with larger gains at higher sampling budgets. It provides a unified framework linking TRPO, KL-regularization, and prompt-conditioned optimization for scalable LLM fine-tuning.
Abstract
GRPO-style reinforcement learning (RL)-based LLM fine-tuning algorithms have recently gained popularity. Relying on heuristic trust-region approximations, however, they can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to regulate samples whose importance ratios fall outside the clipping range. We propose Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through a principled optimization. This yields a clear and interpretable objective that enables explicit control over policy updates and stable, entropy-controlled optimization, with a stabilizer terms arising intrinsically from the exact trust-region formulation. Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning rates, maintaining well-controlled entropy throughout training.
