LAD: Learning Advantage Distribution for Reasoning

Wendi Li; Sharon Li

LAD: Learning Advantage Distribution for Reasoning

Wendi Li, Sharon Li

TL;DR

Learning Advantage Distributions is introduced, a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution and yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse without requiring auxiliary entropy regularization.

Abstract

Current reinforcement learning objectives for large-model reasoning primarily focus on maximizing expected rewards. This paradigm can lead to overfitting to dominant reward signals, while neglecting alternative yet valid reasoning trajectories, thereby limiting diversity and exploration. To address this issue, we introduce Learning Advantage Distributions (LAD), a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution. By establishing the equivalence between the optimal policy update and an advantage-based target distribution, we derive a practical LAD objective formulated as minimizing an $f$-divergence between the policy-induced and advantage-induced distributions. This yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse without requiring auxiliary entropy regularization. LAD incurs no extra training cost compared to GRPO and scales naturally to LLM post-training. In a controlled bandit setting, LAD faithfully recovers the multimodal advantage distribution, validating the theoretical formulation. Experiments on math and code reasoning tasks across several LLM backbones show that LAD reliably improves both accuracy and generative diversity.

LAD: Learning Advantage Distribution for Reasoning

TL;DR

Abstract

-divergence between the policy-induced and advantage-induced distributions. This yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse without requiring auxiliary entropy regularization. LAD incurs no extra training cost compared to GRPO and scales naturally to LLM post-training. In a controlled bandit setting, LAD faithfully recovers the multimodal advantage distribution, validating the theoretical formulation. Experiments on math and code reasoning tasks across several LLM backbones show that LAD reliably improves both accuracy and generative diversity.

Paper Structure (55 sections, 3 theorems, 25 equations, 6 figures, 8 tables)

This paper contains 55 sections, 3 theorems, 25 equations, 6 figures, 8 tables.

Introduction
Preliminary
Reward Maximization in RL for LLMs.
Trust-Region Constrained Policy Optimization.
Method
Learn Advantage Distribution with $f$-divergence
LAD: Minimizing the divergence between $\mathcal{P_{\pi_{\theta}}}$ and $\mathcal{P_A}$.
Practical Implementations
Interpretation.
Analyses and Discussions
$\mathcal{L}_\mathrm{LAD}$ yields desirable loss behavior with implicit regularization.
$\mathcal{L}_\mathrm{LAD}$ is a close surrogate to the theoretical LAD objective $\mathcal{L}_\mathrm{LAD}^\mathrm{theorem}$.
Experiments
Controlled Experiments
Settings.
...and 40 more sections

Key Result

Lemma 3.1

Let $\pi_\theta$ denote the current policy parameterized by $\theta$, and let $\pi_{\mathrm{old}}$ be the behavior model. Suppose that $\pi_\theta$ is optimized using a trust-region constrained reinforcement learning algorithm (e.g., PPO ppo) as in Eq. eq:trpo. There exists a Lagrange multiplier $\e where $\eta$ is a Lagrangian multiplier, $Z_\pi(x) = \sum_y \frac{\pi_\theta^*(y|x)}{\pi_\mathrm{ol

Figures (6)

Figure 1: Advantage distribution and policy distribution on 50-arm bandit trained via different training objectives.
Figure 2: Learning dynamics of GRPO, the theoretical LAD loss, and the practical LAD loss. The loss landscape is constructed with the practical LAD loss. Top: LAD loss under Hellinger distance. Bottom: LAD loss under Jensen–Shannon divergence.
Figure 3: Average Avg@16 and Pass@16 scores over six mathematical benchmarks with different values of the hyperparameter $\eta$.
Figure 4: The average math reasoning performance of LAD loss variants under different $f$-divergence classes. The dashed lines represent the average performances of the strict and non-strict divergence classes.
Figure 5: The distribution of GPU time per token throughout the training process.
...and 1 more figures

Theorems & Definitions (7)

Lemma 3.1: Distribution equivalence
Definition 3.1: $f$-divergence
Lemma 3.2: Linking loss formulations to their induced optimal policies.
proof
proof
Theorem B.1: LAD is an $\mathcal{O}(\delta)$-accurate surrogate of the theoretical LAD objective
proof

LAD: Learning Advantage Distribution for Reasoning

TL;DR

Abstract

LAD: Learning Advantage Distribution for Reasoning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (7)