Table of Contents
Fetching ...

TROLL: Trust Regions improve Reinforcement Learning for Large Language Models

Philipp Becker, Niklas Freymuth, Serge Thilges, Fabian Otto, Gerhard Neumann

TL;DR

The problem asks for the sum of squares of a list of positive integers with total sum $30$, a unique mode of $9$, and a median that is a positive integer not present in the list. By analyzing parity and mode constraints, the only viable configuration is a four-element list containing two $9$s and two smaller distinct numbers totaling $12$. The remaining numbers must be less than $9$ to ensure an integer median not in the list; the only such pair is $(5,7)$, yielding the median $8$ not in the list and the multiset $\{5,7,9,9\}$. Therefore, the sum of squares is $5^2+7^2+9^2+9^2=236$, and the final answer is $\boxed{236}$.

Abstract

On-policy Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs). Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched. Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance. We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints. The projection operates on a sparse subset of the model's most important token logits to balance computational cost and projection effectiveness. Our approach, Trust Region Optimization for Large Language Models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model's inference behavior. Across datasets, model families, and advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.

TROLL: Trust Regions improve Reinforcement Learning for Large Language Models

TL;DR

The problem asks for the sum of squares of a list of positive integers with total sum , a unique mode of , and a median that is a positive integer not present in the list. By analyzing parity and mode constraints, the only viable configuration is a four-element list containing two s and two smaller distinct numbers totaling . The remaining numbers must be less than to ensure an integer median not in the list; the only such pair is , yielding the median not in the list and the multiset . Therefore, the sum of squares is , and the final answer is .

Abstract

On-policy Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs). Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched. Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance. We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints. The projection operates on a sparse subset of the model's most important token logits to balance computational cost and projection effectiveness. Our approach, Trust Region Optimization for Large Language Models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model's inference behavior. Across datasets, model families, and advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.

Paper Structure

This paper contains 30 sections, 2 theorems, 37 equations, 14 figures, 5 tables.

Key Result

Theorem A.1

For any pair of logits $o_t^{(1)}$ and $o_t^{(2)}$, with $\tilde{\pi}(o_t^{(1)} \, | \, \bm{q}, \bm{o}_{<t}) \ge \tilde{\pi}(o_t^{(2)} \, | \, \bm{q}, \bm{o}_{<t})$ w.l.o.g., the logit-wise terms that sum to the KL are equally ordered iff $e^\kappa \ge \gamma$, where $\kappa = \frac{ \tilde{\pi}(o_t^{(1)} \, | \, \bm{q}, \bm{o}_{<t} ) }{ \tilde{\pi}(o_t^{(2)} \, | \, \bm{q}, \bm{o}_{<t} ) }$ is t

Figures (14)

  • Figure 1: overview. Left: Example of a 3-token distribution (cat, troll, hamster). The old policy (red) favors the troll, while the new policy (blue) shifts toward the hamster. The projection (green) ensures that the updated policy stays within the trust region (black). Right: This projection yields clear performance gains over PPO clipping on our MATH-Eval suite (see \ref{['sec:experiments']}), as shown here for Qwen3-14B trained with GRPO.
  • Figure 2: During training, we maintain a sparse token probabilities for the generated sequences. For a given update step, the new logit distribution is similarly sparsified and then compared to the old distribution. If the between these distributions is too large, the new distribution is projected back onto a trust region of the old distribution. Opposed to -like clipping, this projection ensures similarity between the old and new policy while preserving gradients.
  • Figure 3: Comparison of TROLL (full lines) and Clip (dashed lines) across -trained Qwen3 models with $600$M to $14$B parameters. Full-opacity lines mark smoothed results, while the background shows original values. TROLL consistently boosts training efficiency and final success rates (left), which translates to in-distribution questions (middle) and out-of-domain test data (right).
  • Figure 4: Left: Final evaluations for and Clip for different combinations of models and datasets trained with . The better approach between and Clip is marked in blue. Right: Comparison of TROLL (full lines) and the Clip objective (dashed lines) for different models trained with GRPO. generally improves over Clip, and performs well across all considered datasets. In particular, leads to significantly faster learning for different Llama models, where Clip often takes significantly more iterations to obtain a positive training signal. also showcases more stable performance compared to Clip throughout training.
  • Figure 5: Left: Qwen3-$1.7$B trained with GRPO using the projection compared to different hyperparameter choices. works well for conservative bounds $\epsilon$ and top-$K$ logit selections, but is slower for too conservative values and degrades slightly for too aggressive updates or token pruning. Top Right: Memory and runtime comparison between and Clip in a controlled environment. imposes a modest overhead compared to the cost of training the parameters. Bottom Right: generally maintains more entropy during training while showing higher success rates when compared to Clip, as shown for Qwen3-$14$B.
  • ...and 9 more figures

Theorems & Definitions (6)

  • Theorem A.1
  • proof
  • Definition A.1
  • Theorem A.2
  • proof
  • proof