Does This Gradient Spark Joy?

Ian Osband

Does This Gradient Spark Joy?

Ian Osband

Abstract

Policy gradient computes a backward pass for every sample, even though the backward pass is expensive and most samples carry little learning value. The Delightful Policy Gradient (DG) provides a forward-pass signal of learning value: \emph{delight}, the product of advantage and surprisal (negative log-probability). We introduce the \emph{Kondo gate}, which compares delight against a compute price and pays for a backward pass only when the sample is worth it, thereby tracing a quality--cost Pareto frontier. In bandits, zero-price gating preserves useful gradient signal while removing perpendicular noise, and delight is a more reliable screening signal than additive combinations of value and surprise. On MNIST and transformer token reversal, the Kondo gate skips most backward passes while retaining nearly all of DG's learning quality, with gains that grow as problems get harder and backward passes become more expensive. Because the gate tolerates approximate delight, a cheap forward pass can screen samples before expensive backpropagation, suggesting a speculative-decoding-for-training paradigm.

Does This Gradient Spark Joy?

Abstract

Paper Structure (47 sections, 4 theorems, 7 equations, 21 figures, 1 algorithm)

This paper contains 47 sections, 4 theorems, 7 equations, 21 figures, 1 algorithm.

Introduction
The Kondo Gate
Implementation
Why Delight, Not Simpler Priority Signals?
MNIST Diagnostic
Core Results
Compute Efficiency and Approximate Delight
Tabular Analysis
Pareto Improvement and Priority Signal
The Gambling Pathology
Token Reversal
Related Work
Selective backpropagation and curriculum learning.
Prioritized experience replay.
Speculative decoding.
...and 32 more sections

Key Result

Proposition 1

Under a $K$-armed bandit with softmax policy $\pi = \mathrm{softmax}(z)$, deterministic reward $R = \mathbb{I}\{A = y^*\}$, and correct-action probability $p = \pi(y^*)$, consider the zero-price hard gate that keeps samples with $\chi > 0$ and skips those with $\chi < 0$:

Figures (21)

Figure 1: PG, DG, and Kondo gate (DG-K) at $\rho = 0.03$ on MNIST. (a) The Kondo gate matches DG despite computing 3% of backward passes. (b) It dominates by two orders of magnitude in backward-pass space. Averaged over 30 seeds; shading shows $\pm 1$ standard error.
Figure 2: Gate rate sweep ($\rho \in \{0.01, \ldots, 1.0\}$), learning rate tuned per $\rho$. (a) All gate rates converge to $\sim 0.5\%$ error eventually. (b) In backward-step space, smaller $\rho$ reaches any error with orders-of-magnitude fewer backward passes.
Figure 3: Compute speedup vs PG to reach 5% test error on MNIST, as a function of the backward/forward cost ratio. DG's advantage is constant ($\sim 2\times$, better learning). DG-K's advantage grows linearly with backward cost (fewer backward passes). At a typical ratio of $4\times$, the Kondo gate is $6\times$ faster than PG.
Figure 4: Noise robustness on MNIST. (a) Delight noise scaled relative to $\mathrm{std}(\chi)$: DG tolerates $\sim 50\%$; DG-K degrades earlier. (b) Logit noise: DG is robust until $\sigma_Z \approx 1$; DG-K degrades faster. Both validate that approximate forward passes and approximate delight preserve the gate's value.
Figure 5: Priority signal comparison on MNIST. (a) Delight is robust across backward batch sizes; surprisal-only fails. (b) The additive mix collapses for $\alpha > 0.3$; delight (product) is $\alpha$-independent. Validates Proposition \ref{['prop:delight_dominance']}.
...and 16 more figures

Theorems & Definitions (10)

Proposition 1: Kondo gate Pareto improvement
Proposition 2: Delight is sign-consistent; additive mixes can mis-rank
Proposition 3: Gambling pathology
Lemma 1: Softmax gradient geometry
proof
Remark 1: The arithmetic of noise
proof
proof
proof
Remark 2: An environmental limit, not an algorithmic flaw

Does This Gradient Spark Joy?

Abstract

Does This Gradient Spark Joy?

Authors

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (10)