The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL

Yingru Li; Jiawei Xu; Ziniu Li; Jiacai Liu; Wei Liu; Yuxuan Tong; Longtao Zheng; Zhenghai Xue; Yaxiang Zhang; Tianle Cai; Ge Zhang; Qian Liu; Baoxiang Wang

The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL

Yingru Li, Jiawei Xu, Ziniu Li, Jiacai Liu, Wei Liu, Yuxuan Tong, Longtao Zheng, Zhenghai Xue, Yaxiang Zhang, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang

TL;DR

The paper tackles gradient-variance–driven training collapse in long-horizon RL for LLMs by deriving the Optimal Token Baseline (OTB), a causal, token-level variance-minimizing baseline that weights updates by realized gradient energy. It introduces the Logit-Gradient Proxy to estimate token-level gradient energy from forward-pass probabilities, enabling a practical, backward-pass–free implementation. Theoretical results establish unbiasedness and concrete variance reduction, while empirical studies show OTB achieves superior performance and stability with small group sizes, dramatically improving sample efficiency across single-turn and multi-turn (tool-integrated) reasoning, including longer contexts and larger models. The work has significant practical impact for scalable RL-Aligned LLMs, enabling stable training and high performance in demanding reasoning tasks and setting the stage for broader applications such as search or autonomous agents.

Abstract

Reinforcement Learning (RL) for Large Language Models (LLMs) often suffers from training collapse in long-horizon tasks due to exploding gradient variance. To mitigate this, a baseline is commonly introduced for advantage computation; however, traditional value models remain difficult to optimize, and standard group-based baselines overlook sequence heterogeneity. Although classic optimal baseline theory can achieve global variance reduction, it neglects token heterogeneity and requires prohibitive gradient-based computation. In this work, we derive the Optimal Token Baseline (OTB) from first principles, proving that gradient updates should be weighted inversely to their cumulative gradient norm. To ensure efficiency, we propose the Logit-Gradient Proxy that approximates the gradient norm using only forward-pass probabilities. Our method achieves training stability and matches the performance of large group sizes ($N=32$) with only $N=4$, reducing token consumption by over 65% across single-turn and tool-integrated reasoning tasks.

The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL

TL;DR

Abstract

) with only

, reducing token consumption by over 65% across single-turn and tool-integrated reasoning tasks.

Paper Structure (44 sections, 3 theorems, 65 equations, 20 figures, 2 tables)

This paper contains 44 sections, 3 theorems, 65 equations, 20 figures, 2 tables.

Introduction
Related Works
Gradient Variance for Training Instability
The Optimal Token Baseline
Logit-Gradient Proxy
Theoretical Analysis
Unbiasedness of the OTB
Variance Reduction of the OTB
Justification for Logit-Gradient Proxy
Empirical Studies
Comparative Results on Performance Metrics
Eliminating Training Collapse
Breaking the Sample Efficiency Barrier
The Logit-Gradient Proxy Matters
Robustness under Longer Contexts
...and 29 more sections

Key Result

Theorem 4.1

The Optimal Token Baseline (OTB) at step $t$ is the weighted centroid of the reward-to-go, weighted by the realized energy:

Figures (20)

Figure 1: Gradient Norm and AIME25 Score under Single-Turn Reasoning. We adopt full on-policy training on the Qwen3-8B-Base. The vertical dotted line indicates the point at which the compared methods collapse, coinciding with a sudden surge in gradient norm. Notably, our Optimal Token Baseline yields a stable gradient norm, resulting in stable training and a higher score.
Figure 2: High gradient variance triggers a sudden surge in the gradient norm, leading to an eventual training collapse. The calculation of gradient variance is introduced in Appendix \ref{['app:variance_proxy']}.
Figure 3: Different sequences in the same group exhibit distinct energy. The calculation of total energy is provided in Appendix \ref{['app:ogb_imple']}.
Figure 4: Individual tokens contribute varying energy. Within a single generation step $t$, sequences exhibit distinct energy profiles. For instance, at $t=100$, the realized energy ranks from lowest to highest as Seq $10 < 4 < 13 < 14$. However, this ranking shifts significantly by $t=600$: the order becomes Seq $14 < 4 < 10 < 13$, with further re-ranking occurring by $t=1000$. The calculation of realized energy is provided in \ref{['sec:logit-graident']}.
Figure 5: The relationship between Token Probability (Model Confidence) and Logit Gradient Norm (Uncertainty Measure).
...and 15 more figures

Theorems & Definitions (6)

Definition 3.1: Realized Energy
Theorem 4.1: Optimal Token Baseline
Proposition 4.2: Logit-Gradient Proxy
Remark 3.1: Role of $B_k$ Terms
Proposition 3.3: Convex Decomposition
proof

The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL

TL;DR

Abstract

The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (6)