TROLL: Trust Regions improve Reinforcement Learning for Large Language Models

Philipp Becker; Niklas Freymuth; Serge Thilges; Fabian Otto; Gerhard Neumann

TROLL: Trust Regions improve Reinforcement Learning for Large Language Models

Philipp Becker, Niklas Freymuth, Serge Thilges, Fabian Otto, Gerhard Neumann

TL;DR

The problem asks for the sum of squares of a list of positive integers with total sum $30$, a unique mode of $9$, and a median that is a positive integer not present in the list. By analyzing parity and mode constraints, the only viable configuration is a four-element list containing two $9$s and two smaller distinct numbers totaling $12$. The remaining numbers must be less than $9$ to ensure an integer median not in the list; the only such pair is $(5,7)$, yielding the median $8$ not in the list and the multiset $\{5,7,9,9\}$. Therefore, the sum of squares is $5^2+7^2+9^2+9^2=236$, and the final answer is $\boxed{236}$.

Abstract

On-policy Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs). Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched. Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance. We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints. The projection operates on a sparse subset of the model's most important token logits to balance computational cost and projection effectiveness. Our approach, Trust Region Optimization for Large Language Models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model's inference behavior. Across datasets, model families, and advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.

TROLL: Trust Regions improve Reinforcement Learning for Large Language Models

TL;DR

Abstract

TROLL: Trust Regions improve Reinforcement Learning for Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (6)