Table of Contents
Fetching ...

Variance-Reduced Gradient Estimation via Noise-Reuse in Online Evolution Strategies

Oscar Li, James Harrison, Jascha Sohl-Dickstein, Virginia Smith, Luke Metz

TL;DR

This work tackles gradient estimation for unrolled computation graphs where automatic differentiation can fail due to non-smooth or black-box dynamics. It introduces Noise-Reuse Evolution Strategies (NRES), a specific GPES variant that reuses a single Gaussian perturbation across an entire episode to minimize gradient estimator variance while remaining online and unbiased. Theoretical variance analysis shows NRES achieves lower or equal variance compared to PES and FullES under realistic assumptions, and empirical results across dynamical systems, meta-learning optimizers, and reinforcement learning demonstrate faster convergence and better wall-clock efficiency. The work highlights online ES as a practical alternative when AD is ineffective, offering substantial speedups and parallelization advantages in challenging UCG settings.

Abstract

Unrolled computation graphs are prevalent throughout machine learning but present challenges to automatic differentiation (AD) gradient estimation methods when their loss functions exhibit extreme local sensitivtiy, discontinuity, or blackbox characteristics. In such scenarios, online evolution strategies methods are a more capable alternative, while being more parallelizable than vanilla evolution strategies (ES) by interleaving partial unrolls and gradient updates. In this work, we propose a general class of unbiased online evolution strategies methods. We analytically and empirically characterize the variance of this class of gradient estimators and identify the one with the least variance, which we term Noise-Reuse Evolution Strategies (NRES). Experimentally, we show NRES results in faster convergence than existing AD and ES methods in terms of wall-clock time and number of unroll steps across a variety of applications, including learning dynamical systems, meta-training learned optimizers, and reinforcement learning.

Variance-Reduced Gradient Estimation via Noise-Reuse in Online Evolution Strategies

TL;DR

This work tackles gradient estimation for unrolled computation graphs where automatic differentiation can fail due to non-smooth or black-box dynamics. It introduces Noise-Reuse Evolution Strategies (NRES), a specific GPES variant that reuses a single Gaussian perturbation across an entire episode to minimize gradient estimator variance while remaining online and unbiased. Theoretical variance analysis shows NRES achieves lower or equal variance compared to PES and FullES under realistic assumptions, and empirical results across dynamical systems, meta-learning optimizers, and reinforcement learning demonstrate faster convergence and better wall-clock efficiency. The work highlights online ES as a practical alternative when AD is ineffective, offering substantial speedups and parallelization advantages in challenging UCG settings.

Abstract

Unrolled computation graphs are prevalent throughout machine learning but present challenges to automatic differentiation (AD) gradient estimation methods when their loss functions exhibit extreme local sensitivtiy, discontinuity, or blackbox characteristics. In such scenarios, online evolution strategies methods are a more capable alternative, while being more parallelizable than vanilla evolution strategies (ES) by interleaving partial unrolls and gradient updates. In this work, we propose a general class of unbiased online evolution strategies methods. We analytically and empirically characterize the variance of this class of gradient estimators and identify the one with the least variance, which we term Noise-Reuse Evolution Strategies (NRES). Experimentally, we show NRES results in faster convergence than existing AD and ES methods in terms of wall-clock time and number of unroll steps across a variety of applications, including learning dynamical systems, meta-training learned optimizers, and reinforcement learning.
Paper Structure (71 sections, 12 theorems, 70 equations, 16 figures, 8 tables, 7 algorithms)

This paper contains 71 sections, 12 theorems, 70 equations, 16 figures, 8 tables, 7 algorithms.

Key Result

Lemma 1

An unbiased gradient estimator for the $K$-smoothed loss is given by with randomness in $\boldsymbol{\tau}$ and $\{\boldsymbol{\epsilon}_i\}_{i=1}^{\lceil T/K \rceil}$.

Figures (16)

  • Figure 1: (a) The pathological loss surface in the learned optimizer task (Sec. \ref{['exp:lopt']}) along a random $\boldsymbol{\epsilon}$ direction; such surfaces are common in UCGs but can make automatic differentiation methods unusable, leading to the recent development of evolution strategies methods. (b) Comparison of properties of different evolution strategies methods. Unlike prior online ES methods, $\mathrm{NRES}$ produces both unbiased and low-variance gradient estimates.
  • Figure 2: (a) Illustration of step-unlocked online ES workers working independently at different truncation windows. Here a central server sends $\theta$ (whose gradient to be estimated) to each worker and receives the estimates over partial unrolls from each. The averaged gradient can then be used in a first-order optimization algorithm. (b) Comparison of the noise sharing mechanisms of $\mathrm{PES}$, $\mathrm{GPES}_{K}$, and $\mathrm{NRES}$ (ours). Unlike $\mathrm{PES}$ (and $\mathrm{GPES}_{K\neq T}$) which samples a new noise in every (some) truncation window and needs to accumulate the noise, $\mathrm{NRES}$ only samples noise once at the beginning of an episode and reuses the noise for the full episode.
  • Figure 3: Total variance of $\mathrm{GPES}_K$ vs. noise-sharing period $K$ for different $\theta_i$'s from the learned trajectory of $\mathrm{PES}$. $\mathrm{GPES}_{K=T}$ ($\mathrm{NRES}$) has the lowest total variance among estimators of its class (including $\mathrm{PES}$) for each $\theta_i$.
  • Figure 4: (a) Comparison of $\mathrm{FullES}$ and $\mathrm{NRES}$ gradient estimation under the same unroll budget. Unlike $\mathrm{FullES}$ which can only use a single noise perturbation $\boldsymbol{\epsilon}$ to unroll sequentially for an entire episode of length $T$, $\mathrm{NRES}$ can use $T/W$ parallel step-unlocked workers each unrolling inside its random truncation windows of length $W$ with independent perturbations $\boldsymbol{\epsilon}^{(i)}$. This results in a $T/W\times$ speed-up and variance reduction (Theorem \ref{['thm:nres_fulles_comparison']}) over $\mathrm{FullES}$. (b) The total variance of $\mathrm{NRES}$ and $\mathrm{FullES}$ estimators under the same compute budget at the same set of $\theta_i$ checkpoints in Figure \ref{['fig:variance']}(a). $\mathrm{NRES}$ achieves significantly lower total covariance.
  • Figure 5: (a) The pathological training loss surface of the Lorenz system problem (left) and the optimization trajectory of different $\mathrm{GPES}_K$ gradient estimators (right). $\mathrm{NRES}$'s trajectory is the smoothest because of its lowest variance. (b) Different ES methods' loss convergence on the same problem. $\mathrm{NRES}$ converges the fastest.
  • ...and 11 more figures

Theorems & Definitions (21)

  • Lemma 1
  • Remark 3
  • Lemma 4
  • Theorem 5
  • Corollary 6
  • Remark 7
  • Corollary 8
  • Theorem 9
  • Remark 10
  • Lemma 1
  • ...and 11 more