Table of Contents
Fetching ...

LAD: Learning Advantage Distribution for Reasoning

Wendi Li, Sharon Li

TL;DR

Learning Advantage Distributions is introduced, a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution and yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse without requiring auxiliary entropy regularization.

Abstract

Current reinforcement learning objectives for large-model reasoning primarily focus on maximizing expected rewards. This paradigm can lead to overfitting to dominant reward signals, while neglecting alternative yet valid reasoning trajectories, thereby limiting diversity and exploration. To address this issue, we introduce Learning Advantage Distributions (LAD), a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution. By establishing the equivalence between the optimal policy update and an advantage-based target distribution, we derive a practical LAD objective formulated as minimizing an $f$-divergence between the policy-induced and advantage-induced distributions. This yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse without requiring auxiliary entropy regularization. LAD incurs no extra training cost compared to GRPO and scales naturally to LLM post-training. In a controlled bandit setting, LAD faithfully recovers the multimodal advantage distribution, validating the theoretical formulation. Experiments on math and code reasoning tasks across several LLM backbones show that LAD reliably improves both accuracy and generative diversity.

LAD: Learning Advantage Distribution for Reasoning

TL;DR

Learning Advantage Distributions is introduced, a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution and yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse without requiring auxiliary entropy regularization.

Abstract

Current reinforcement learning objectives for large-model reasoning primarily focus on maximizing expected rewards. This paradigm can lead to overfitting to dominant reward signals, while neglecting alternative yet valid reasoning trajectories, thereby limiting diversity and exploration. To address this issue, we introduce Learning Advantage Distributions (LAD), a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution. By establishing the equivalence between the optimal policy update and an advantage-based target distribution, we derive a practical LAD objective formulated as minimizing an -divergence between the policy-induced and advantage-induced distributions. This yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse without requiring auxiliary entropy regularization. LAD incurs no extra training cost compared to GRPO and scales naturally to LLM post-training. In a controlled bandit setting, LAD faithfully recovers the multimodal advantage distribution, validating the theoretical formulation. Experiments on math and code reasoning tasks across several LLM backbones show that LAD reliably improves both accuracy and generative diversity.
Paper Structure (55 sections, 3 theorems, 25 equations, 6 figures, 8 tables)

This paper contains 55 sections, 3 theorems, 25 equations, 6 figures, 8 tables.

Key Result

Lemma 3.1

Let $\pi_\theta$ denote the current policy parameterized by $\theta$, and let $\pi_{\mathrm{old}}$ be the behavior model. Suppose that $\pi_\theta$ is optimized using a trust-region constrained reinforcement learning algorithm (e.g., PPO ppo) as in Eq. eq:trpo. There exists a Lagrange multiplier $\e where $\eta$ is a Lagrangian multiplier, $Z_\pi(x) = \sum_y \frac{\pi_\theta^*(y|x)}{\pi_\mathrm{ol

Figures (6)

  • Figure 1: Advantage distribution and policy distribution on 50-arm bandit trained via different training objectives.
  • Figure 2: Learning dynamics of GRPO, the theoretical LAD loss, and the practical LAD loss. The loss landscape is constructed with the practical LAD loss. Top: LAD loss under Hellinger distance. Bottom: LAD loss under Jensen–Shannon divergence.
  • Figure 3: Average Avg@16 and Pass@16 scores over six mathematical benchmarks with different values of the hyperparameter $\eta$.
  • Figure 4: The average math reasoning performance of LAD loss variants under different $f$-divergence classes. The dashed lines represent the average performances of the strict and non-strict divergence classes.
  • Figure 5: The distribution of GPU time per token throughout the training process.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Lemma 3.1: Distribution equivalence
  • Definition 3.1: $f$-divergence
  • Lemma 3.2: Linking loss formulations to their induced optimal policies.
  • proof
  • proof
  • Theorem B.1: LAD is an $\mathcal{O}(\delta)$-accurate surrogate of the theoretical LAD objective
  • proof