Table of Contents
Fetching ...

Randomized Forward Mode of Automatic Differentiation For Optimization Algorithms

Khemraj Shukla, Yeonjong Shin

TL;DR

The paper proposes randomized forward mode gradient (RFG) as a memory-efficient alternative to backpropagation by estimating gradients via directional derivatives along random vectors computed with forward-mode AD. A second-moment analysis shows that the smallest expected relative error is achieved by distributions with minimal kurtosis $κ_4$, and, for the quadratic setting, optimal variance is $σ^2 = 1/(d+κ_4-1)$, yielding a biased gradient estimate in general. The authors develop and analyze RFG-based gradient descent and Polyak's heavy ball methods, proving linear convergence on quadratic objectives, with the best rates attained when using the Bernoulli distribution ($κ_4=1$). Extensive computational experiments across quadratic and non-quadratic problems, including SciML tasks, demonstrate that Bernoulli-based RFG often outperforms other distributions and can offer favorable iteration throughput compared with backpropagation, highlighting RFG as a practical gradient-estimation approach for large-scale optimization and scientific machine learning.

Abstract

We present a randomized forward mode gradient (RFG) as an alternative to backpropagation. RFG is a random estimator for the gradient that is constructed based on the directional derivative along a random vector. The forward mode automatic differentiation (AD) provides an efficient computation of RFG. The probability distribution of the random vector determines the statistical properties of RFG. Through the second moment analysis, we found that the distribution with the smallest kurtosis yields the smallest expected relative squared error. By replacing gradient with RFG, a class of RFG-based optimization algorithms is obtained. By focusing on gradient descent (GD) and Polyak's heavy ball (PHB) methods, we present a convergence analysis of RFG-based optimization algorithms for quadratic functions. Computational experiments are presented to demonstrate the performance of the proposed algorithms and verify the theoretical findings.

Randomized Forward Mode of Automatic Differentiation For Optimization Algorithms

TL;DR

The paper proposes randomized forward mode gradient (RFG) as a memory-efficient alternative to backpropagation by estimating gradients via directional derivatives along random vectors computed with forward-mode AD. A second-moment analysis shows that the smallest expected relative error is achieved by distributions with minimal kurtosis , and, for the quadratic setting, optimal variance is , yielding a biased gradient estimate in general. The authors develop and analyze RFG-based gradient descent and Polyak's heavy ball methods, proving linear convergence on quadratic objectives, with the best rates attained when using the Bernoulli distribution (). Extensive computational experiments across quadratic and non-quadratic problems, including SciML tasks, demonstrate that Bernoulli-based RFG often outperforms other distributions and can offer favorable iteration throughput compared with backpropagation, highlighting RFG as a practical gradient-estimation approach for large-scale optimization and scientific machine learning.

Abstract

We present a randomized forward mode gradient (RFG) as an alternative to backpropagation. RFG is a random estimator for the gradient that is constructed based on the directional derivative along a random vector. The forward mode automatic differentiation (AD) provides an efficient computation of RFG. The probability distribution of the random vector determines the statistical properties of RFG. Through the second moment analysis, we found that the distribution with the smallest kurtosis yields the smallest expected relative squared error. By replacing gradient with RFG, a class of RFG-based optimization algorithms is obtained. By focusing on gradient descent (GD) and Polyak's heavy ball (PHB) methods, we present a convergence analysis of RFG-based optimization algorithms for quadratic functions. Computational experiments are presented to demonstrate the performance of the proposed algorithms and verify the theoretical findings.
Paper Structure (20 sections, 8 theorems, 79 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 8 theorems, 79 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.4

\newlabelthm:2nd-moment-general0 Suppose that $f$ is continuously differentiable. Let $\bm{z}$ be a random vector whose components are i.i.d. from a probability distribution $\text{p}$ whose first and third moments are zeros, and whose second and fourth moments are finite, denoted by $\sigma^2, \k

Figures (7)

  • Figure 1: A JAX code for implementing JVP of $f(\bm{x})=2\|\bm{x}\|^2$ at $\bm{x}=(0,4,6)$ along $\bm{v}=(1, 1,1)$.
  • Figure 1: Top and bottom left: The averaged squared errors versus the number of iterations obtained by the RFG-based GD using the five different probability distributions at varying dimensions $d=5, 10, 20$. Bottom right: The averaged squared errors obtained by the Bernoulli RFG-based GD along with the upper bounds from Theorem \ref{['thm:RFG-GD-convg']} at varying dimensions $d=5, 10, 20, 30$. The shaded area represents the area that falls within one standard deviation of the mean.
  • Figure 2: JAX code for evaluating the gradient of a quadratic function using VJP
  • Figure 2: The value map of $\Phi^{10K}(I_{2d})$ on the grid $\Omega$ at $d=30$ for the Bernoulli distribution (left) and the Laplace distribution (middle). Right: The averaged squared errors versus the number of iterations obtained by the RFG-based PHB using the five different probability distributions at $d=30$.
  • Figure 3: The objective function values versus the number of iterations by the RFG algorithms with five different probability distributions. The average of five independent simulations is reported. Left: The Rosenbrock function. Right: The Ackley function.
  • ...and 2 more figures

Theorems & Definitions (25)

  • Remark 2.1
  • Remark 2.2
  • Definition 3.1
  • Remark 3.2
  • Remark 3.3
  • Theorem 3.4
  • Proof 1
  • Theorem 3.5
  • Proof 2
  • Proposition 4.2
  • ...and 15 more