Randomized Forward Mode of Automatic Differentiation For Optimization Algorithms

Khemraj Shukla; Yeonjong Shin

Randomized Forward Mode of Automatic Differentiation For Optimization Algorithms

Khemraj Shukla, Yeonjong Shin

TL;DR

The paper proposes randomized forward mode gradient (RFG) as a memory-efficient alternative to backpropagation by estimating gradients via directional derivatives along random vectors computed with forward-mode AD. A second-moment analysis shows that the smallest expected relative error is achieved by distributions with minimal kurtosis $κ_4$, and, for the quadratic setting, optimal variance is $σ^2 = 1/(d+κ_4-1)$, yielding a biased gradient estimate in general. The authors develop and analyze RFG-based gradient descent and Polyak's heavy ball methods, proving linear convergence on quadratic objectives, with the best rates attained when using the Bernoulli distribution ($κ_4=1$). Extensive computational experiments across quadratic and non-quadratic problems, including SciML tasks, demonstrate that Bernoulli-based RFG often outperforms other distributions and can offer favorable iteration throughput compared with backpropagation, highlighting RFG as a practical gradient-estimation approach for large-scale optimization and scientific machine learning.

Abstract

We present a randomized forward mode gradient (RFG) as an alternative to backpropagation. RFG is a random estimator for the gradient that is constructed based on the directional derivative along a random vector. The forward mode automatic differentiation (AD) provides an efficient computation of RFG. The probability distribution of the random vector determines the statistical properties of RFG. Through the second moment analysis, we found that the distribution with the smallest kurtosis yields the smallest expected relative squared error. By replacing gradient with RFG, a class of RFG-based optimization algorithms is obtained. By focusing on gradient descent (GD) and Polyak's heavy ball (PHB) methods, we present a convergence analysis of RFG-based optimization algorithms for quadratic functions. Computational experiments are presented to demonstrate the performance of the proposed algorithms and verify the theoretical findings.

Randomized Forward Mode of Automatic Differentiation For Optimization Algorithms

TL;DR

, and, for the quadratic setting, optimal variance is

, yielding a biased gradient estimate in general. The authors develop and analyze RFG-based gradient descent and Polyak's heavy ball methods, proving linear convergence on quadratic objectives, with the best rates attained when using the Bernoulli distribution (

). Extensive computational experiments across quadratic and non-quadratic problems, including SciML tasks, demonstrate that Bernoulli-based RFG often outperforms other distributions and can offer favorable iteration throughput compared with backpropagation, highlighting RFG as a practical gradient-estimation approach for large-scale optimization and scientific machine learning.

Abstract

Paper Structure (20 sections, 8 theorems, 79 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 8 theorems, 79 equations, 7 figures, 2 tables, 1 algorithm.

Introduction
Preliminaries on automatic differentiation
Forward mode AD or Jacobian-vector product
Reverse mode AD or vector-Jacobian product
Forward mode AD-based gradients
RFG-based optimization algorithms
Second-moment analysis of the RFG
Convergence analysis for quadratic functions
RFG-based gradient descent
RFG-based Polyak's heavy ball method
Computational Examples
Quadratic functions
Optimization test problems
Scientific machine learning examples
Computational time comparison: RFG vs Backpropgation
...and 5 more sections

Key Result

Theorem 3.4

\newlabelthm:2nd-moment-general0 Suppose that $f$ is continuously differentiable. Let $\bm{z}$ be a random vector whose components are i.i.d. from a probability distribution $\text{p}$ whose first and third moments are zeros, and whose second and fourth moments are finite, denoted by $\sigma^2, \k

Figures (7)

Figure 1: A JAX code for implementing JVP of $f(\bm{x})=2\|\bm{x}\|^2$ at $\bm{x}=(0,4,6)$ along $\bm{v}=(1, 1,1)$.
Figure 1: Top and bottom left: The averaged squared errors versus the number of iterations obtained by the RFG-based GD using the five different probability distributions at varying dimensions $d=5, 10, 20$. Bottom right: The averaged squared errors obtained by the Bernoulli RFG-based GD along with the upper bounds from Theorem \ref{['thm:RFG-GD-convg']} at varying dimensions $d=5, 10, 20, 30$. The shaded area represents the area that falls within one standard deviation of the mean.
Figure 2: JAX code for evaluating the gradient of a quadratic function using VJP
Figure 2: The value map of $\Phi^{10K}(I_{2d})$ on the grid $\Omega$ at $d=30$ for the Bernoulli distribution (left) and the Laplace distribution (middle). Right: The averaged squared errors versus the number of iterations obtained by the RFG-based PHB using the five different probability distributions at $d=30$.
Figure 3: The objective function values versus the number of iterations by the RFG algorithms with five different probability distributions. The average of five independent simulations is reported. Left: The Rosenbrock function. Right: The Ackley function.
...and 2 more figures

Theorems & Definitions (25)

Remark 2.1
Remark 2.2
Definition 3.1
Remark 3.2
Remark 3.3
Theorem 3.4
Proof 1
Theorem 3.5
Proof 2
Proposition 4.2
...and 15 more

Randomized Forward Mode of Automatic Differentiation For Optimization Algorithms

TL;DR

Abstract

Randomized Forward Mode of Automatic Differentiation For Optimization Algorithms

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (25)