Table of Contents
Fetching ...

Does This Gradient Spark Joy?

Ian Osband

Abstract

Policy gradient computes a backward pass for every sample, even though the backward pass is expensive and most samples carry little learning value. The Delightful Policy Gradient (DG) provides a forward-pass signal of learning value: \emph{delight}, the product of advantage and surprisal (negative log-probability). We introduce the \emph{Kondo gate}, which compares delight against a compute price and pays for a backward pass only when the sample is worth it, thereby tracing a quality--cost Pareto frontier. In bandits, zero-price gating preserves useful gradient signal while removing perpendicular noise, and delight is a more reliable screening signal than additive combinations of value and surprise. On MNIST and transformer token reversal, the Kondo gate skips most backward passes while retaining nearly all of DG's learning quality, with gains that grow as problems get harder and backward passes become more expensive. Because the gate tolerates approximate delight, a cheap forward pass can screen samples before expensive backpropagation, suggesting a speculative-decoding-for-training paradigm.

Does This Gradient Spark Joy?

Abstract

Policy gradient computes a backward pass for every sample, even though the backward pass is expensive and most samples carry little learning value. The Delightful Policy Gradient (DG) provides a forward-pass signal of learning value: \emph{delight}, the product of advantage and surprisal (negative log-probability). We introduce the \emph{Kondo gate}, which compares delight against a compute price and pays for a backward pass only when the sample is worth it, thereby tracing a quality--cost Pareto frontier. In bandits, zero-price gating preserves useful gradient signal while removing perpendicular noise, and delight is a more reliable screening signal than additive combinations of value and surprise. On MNIST and transformer token reversal, the Kondo gate skips most backward passes while retaining nearly all of DG's learning quality, with gains that grow as problems get harder and backward passes become more expensive. Because the gate tolerates approximate delight, a cheap forward pass can screen samples before expensive backpropagation, suggesting a speculative-decoding-for-training paradigm.
Paper Structure (47 sections, 4 theorems, 7 equations, 21 figures, 1 algorithm)

This paper contains 47 sections, 4 theorems, 7 equations, 21 figures, 1 algorithm.

Key Result

Proposition 1

Under a $K$-armed bandit with softmax policy $\pi = \mathrm{softmax}(z)$, deterministic reward $R = \mathbb{I}\{A = y^*\}$, and correct-action probability $p = \pi(y^*)$, consider the zero-price hard gate that keeps samples with $\chi > 0$ and skips those with $\chi < 0$:

Figures (21)

  • Figure 1: PG, DG, and Kondo gate (DG-K) at $\rho = 0.03$ on MNIST. (a) The Kondo gate matches DG despite computing 3% of backward passes. (b) It dominates by two orders of magnitude in backward-pass space. Averaged over 30 seeds; shading shows $\pm 1$ standard error.
  • Figure 2: Gate rate sweep ($\rho \in \{0.01, \ldots, 1.0\}$), learning rate tuned per $\rho$. (a) All gate rates converge to $\sim 0.5\%$ error eventually. (b) In backward-step space, smaller $\rho$ reaches any error with orders-of-magnitude fewer backward passes.
  • Figure 3: Compute speedup vs PG to reach 5% test error on MNIST, as a function of the backward/forward cost ratio. DG's advantage is constant ($\sim 2\times$, better learning). DG-K's advantage grows linearly with backward cost (fewer backward passes). At a typical ratio of $4\times$, the Kondo gate is $6\times$ faster than PG.
  • Figure 4: Noise robustness on MNIST. (a) Delight noise scaled relative to $\mathrm{std}(\chi)$: DG tolerates $\sim 50\%$; DG-K degrades earlier. (b) Logit noise: DG is robust until $\sigma_Z \approx 1$; DG-K degrades faster. Both validate that approximate forward passes and approximate delight preserve the gate's value.
  • Figure 5: Priority signal comparison on MNIST. (a) Delight is robust across backward batch sizes; surprisal-only fails. (b) The additive mix collapses for $\alpha > 0.3$; delight (product) is $\alpha$-independent. Validates Proposition \ref{['prop:delight_dominance']}.
  • ...and 16 more figures

Theorems & Definitions (10)

  • Proposition 1: Kondo gate Pareto improvement
  • Proposition 2: Delight is sign-consistent; additive mixes can mis-rank
  • Proposition 3: Gambling pathology
  • Lemma 1: Softmax gradient geometry
  • proof
  • Remark 1: The arithmetic of noise
  • proof
  • proof
  • proof
  • Remark 2: An environmental limit, not an algorithmic flaw