Table of Contents
Fetching ...

Delightful Policy Gradient

Ian Osband

Abstract

Standard policy gradients weight each sampled action by advantage alone, regardless of how likely that action was under the current policy. This creates two pathologies: within a single decision context (e.g. one image or prompt), a rare negative-advantage action can disproportionately distort the update direction; across many such contexts in a batch, the expected gradient over-allocates budget to contexts the policy already handles well. We introduce the \textit{Delightful Policy Gradient} (DG), which gates each term with a sigmoid of \emph{delight}, the product of advantage and action surprisal (negative log-probability). For $K$-armed bandits, DG provably improves directional accuracy in a single context and, across multiple contexts, shifts the expected gradient strictly closer to the supervised cross-entropy oracle. This second effect is not variance reduction: it persists even with infinite samples. Empirically, DG outperforms REINFORCE, PPO, and advantage-weighted baselines across MNIST, transformer sequence modeling, and continuous control, with larger gains on harder tasks.

Delightful Policy Gradient

Abstract

Standard policy gradients weight each sampled action by advantage alone, regardless of how likely that action was under the current policy. This creates two pathologies: within a single decision context (e.g. one image or prompt), a rare negative-advantage action can disproportionately distort the update direction; across many such contexts in a batch, the expected gradient over-allocates budget to contexts the policy already handles well. We introduce the \textit{Delightful Policy Gradient} (DG), which gates each term with a sigmoid of \emph{delight}, the product of advantage and action surprisal (negative log-probability). For -armed bandits, DG provably improves directional accuracy in a single context and, across multiple contexts, shifts the expected gradient strictly closer to the supervised cross-entropy oracle. This second effect is not variance reduction: it persists even with infinite samples. Empirically, DG outperforms REINFORCE, PPO, and advantage-weighted baselines across MNIST, transformer sequence modeling, and continuous control, with larger gains on harder tasks.
Paper Structure (59 sections, 7 theorems, 22 equations, 28 figures, 2 algorithms)

This paper contains 59 sections, 7 theorems, 22 equations, 28 figures, 2 algorithms.

Key Result

Proposition 1

In the bandit above, for any $\eta > 0$: (i) DG preserves the expected gradient direction: $\mathbb{E}[g_{\mathrm{DG}}] = s \cdot g^*_{\mathrm{PG}}$, where $s = (1{-}b)\,w_+ + b\,w_- > 0$. (ii) DG reduces perpendicular variance by exactly $w_-^2$: $\mathrm{Var}_\perp(g_{\mathrm{DG}}) = w_-^2 \cdot \ DG always reduces the alignment gap; both methods converge to cosine $1$ as $B \to \infty$.

Figures (28)

  • Figure 1: Effective coefficient $\omega = w \cdot U$ weighting $\nabla_\theta \log \pi$. DG amplifies breakthroughs (rare successes) and suppresses blunders (rare failures); PG (dashed) is probability-blind.
  • Figure 2: MNIST classification error. Supervised CE requires labels; PG and DG do not.
  • Figure 3: Classification error at $T = 10\text{k}$ vs. samples per image $S$, faceted by baseline. Dashed line: error from the exact PG-oracle gradient $g^*_{\mathrm{PG}}$, PG's best achievable direction.
  • Figure 4: Gradient misalignment over training at $S = 1$ (solid) and $S = 100$ (dashed). (a) DG's advantage diminishes with $S$ (variance). (b) DG's advantage persists at $S = 100$ (directional).
  • Figure 5: Symmetric bandit, normalized steps ($K{=}100$, $B{=}100$, $\alpha{=}0.1$).
  • ...and 23 more figures

Theorems & Definitions (11)

  • Proposition 1: Variance reduction in symmetric bandits
  • Lemma 1: Greedy directions under normalized steps
  • Proposition 2: Directional improvement toward cross-entropy
  • Proposition 3: Cosine controls progress
  • proof
  • Lemma 2: Symmetry identity
  • proof
  • Lemma 3: Two-vector cosine monotonicity
  • proof
  • Proposition 4: Directional improvement, general $N$
  • ...and 1 more