Delightful Policy Gradient

Ian Osband

Delightful Policy Gradient

Ian Osband

Abstract

Standard policy gradients weight each sampled action by advantage alone, regardless of how likely that action was under the current policy. This creates two pathologies: within a single decision context (e.g. one image or prompt), a rare negative-advantage action can disproportionately distort the update direction; across many such contexts in a batch, the expected gradient over-allocates budget to contexts the policy already handles well. We introduce the \textit{Delightful Policy Gradient} (DG), which gates each term with a sigmoid of \emph{delight}, the product of advantage and action surprisal (negative log-probability). For $K$-armed bandits, DG provably improves directional accuracy in a single context and, across multiple contexts, shifts the expected gradient strictly closer to the supervised cross-entropy oracle. This second effect is not variance reduction: it persists even with infinite samples. Empirically, DG outperforms REINFORCE, PPO, and advantage-weighted baselines across MNIST, transformer sequence modeling, and continuous control, with larger gains on harder tasks.

Delightful Policy Gradient

Abstract

-armed bandits, DG provably improves directional accuracy in a single context and, across multiple contexts, shifts the expected gradient strictly closer to the supervised cross-entropy oracle. This second effect is not variance reduction: it persists even with infinite samples. Empirically, DG outperforms REINFORCE, PPO, and advantage-weighted baselines across MNIST, transformer sequence modeling, and continuous control, with larger gains on harder tasks.

Paper Structure (59 sections, 7 theorems, 22 equations, 28 figures, 2 algorithms)

This paper contains 59 sections, 7 theorems, 22 equations, 28 figures, 2 algorithms.

Introduction
Delightful Policy Gradient
Definitions
Estimator and Implementation
MNIST Diagnostic
Tabular Analysis
Single context: variance reduction
Beyond symmetry.
Multiple contexts: directional improvement
Transformer Sequence Modeling
Continuous Control
Related Work
Conclusion
The Delightful Gate: Derivation and Properties
Entropy-regularized gate selection.
...and 44 more sections

Key Result

Proposition 1

In the bandit above, for any $\eta > 0$: (i) DG preserves the expected gradient direction: $\mathbb{E}[g_{\mathrm{DG}}] = s \cdot g^*_{\mathrm{PG}}$, where $s = (1{-}b)\,w_+ + b\,w_- > 0$. (ii) DG reduces perpendicular variance by exactly $w_-^2$: $\mathrm{Var}_\perp(g_{\mathrm{DG}}) = w_-^2 \cdot \ DG always reduces the alignment gap; both methods converge to cosine $1$ as $B \to \infty$.

Figures (28)

Figure 1: Effective coefficient $\omega = w \cdot U$ weighting $\nabla_\theta \log \pi$. DG amplifies breakthroughs (rare successes) and suppresses blunders (rare failures); PG (dashed) is probability-blind.
Figure 2: MNIST classification error. Supervised CE requires labels; PG and DG do not.
Figure 3: Classification error at $T = 10\text{k}$ vs. samples per image $S$, faceted by baseline. Dashed line: error from the exact PG-oracle gradient $g^*_{\mathrm{PG}}$, PG's best achievable direction.
Figure 4: Gradient misalignment over training at $S = 1$ (solid) and $S = 100$ (dashed). (a) DG's advantage diminishes with $S$ (variance). (b) DG's advantage persists at $S = 100$ (directional).
Figure 5: Symmetric bandit, normalized steps ($K{=}100$, $B{=}100$, $\alpha{=}0.1$).
...and 23 more figures

Theorems & Definitions (11)

Proposition 1: Variance reduction in symmetric bandits
Lemma 1: Greedy directions under normalized steps
Proposition 2: Directional improvement toward cross-entropy
Proposition 3: Cosine controls progress
proof
Lemma 2: Symmetry identity
proof
Lemma 3: Two-vector cosine monotonicity
proof
Proposition 4: Directional improvement, general $N$
...and 1 more

Delightful Policy Gradient

Abstract

Delightful Policy Gradient

Authors

Abstract

Table of Contents

Key Result

Figures (28)

Theorems & Definitions (11)