OptEx: Expediting First-Order Optimization with Approximately Parallelized Iterations

Yao Shu; Jiongfeng Fang; Ying Tiffany He; Fei Richard Yu

OptEx: Expediting First-Order Optimization with Approximately Parallelized Iterations

Yao Shu, Jiongfeng Fang, Ying Tiffany He, Fei Richard Yu

TL;DR

This paper provides theoretical guarantees for the reliability of the kernelized gradient estimation and the iteration complexity of SGD-based OptEx, confirming that estimation errors diminish to zero as historical gradients accumulate and that SGD-based OptEx enjoys an effective acceleration rate of $\Omega(\sqrt{N})$ over standard SGD given parallelism of N.

Abstract

First-order optimization (FOO) algorithms are pivotal in numerous computational domains such as machine learning and signal denoising. However, their application to complex tasks like neural network training often entails significant inefficiencies due to the need for many sequential iterations for convergence. In response, we introduce first-order optimization expedited with approximately parallelized iterations (OptEx), the first framework that enhances the efficiency of FOO by leveraging parallel computing to mitigate its iterative bottleneck. OptEx employs kernelized gradient estimation to make use of gradient history for future gradient prediction, enabling parallelization of iterations -- a strategy once considered impractical because of the inherent iterative dependency in FOO. We provide theoretical guarantees for the reliability of our kernelized gradient estimation and the iteration complexity of SGD-based OptEx, confirming that estimation errors diminish to zero as historical gradients accumulate and that SGD-based OptEx enjoys an effective acceleration rate of $Ω(\sqrt{N})$ over standard SGD given parallelism of N. We also use extensive empirical studies, including synthetic functions, reinforcement learning tasks, and neural network training across various datasets, to underscore the substantial efficiency improvements achieved by OptEx.

OptEx: Expediting First-Order Optimization with Approximately Parallelized Iterations

TL;DR

over standard SGD given parallelism of N.

Abstract

over standard SGD given parallelism of N. We also use extensive empirical studies, including synthetic functions, reinforcement learning tasks, and neural network training across various datasets, to underscore the substantial efficiency improvements achieved by OptEx.

Paper Structure (37 sections, 15 theorems, 64 equations, 10 figures, 1 table, 1 algorithm)

This paper contains 37 sections, 15 theorems, 64 equations, 10 figures, 1 table, 1 algorithm.

Introduction
Related Work
Reduction of Iteration Complexity.
Reduction of Time Complexity Per Iteration using Parallel Computing.
Problem Setup
The OptEx Framework
Kernelized Gradient Estimation
Separable Kernel Function.
Local History of Gradients.
Multi-Step Proxy Updates
Approximately Parallelized Iterations
Theoretical Results
Gradient Estimation Analysis
Iteration Complexity Analysis
Experiments
...and 22 more sections

Key Result

Proposition 4.1

Let ${\mathbf{K}}(\cdot,\cdot) = k(\cdot, \cdot)\,{\mathbf{I}}$, the posterior mean and covariance in eq:posterior become

Figures (10)

Figure 1: An illustration of OptEx at iteration $t$.
Figure 2: Comparison of the number of sequential iterations $T$ ($x$-axis) required by different methods to achieve the same optimality gap $F({\bm{\theta}}) - \inf_{{\bm{\theta}}} F({\bm{\theta}})$ ($y$-axis) for various synthetic functions . The parallelism $N$ is set to 5 and each curve denotes the mean from 5 independent runs.
Figure 3: Comparison of the cumulative average reward ($y$-axis) achieved by different methods to train DQN on RL tasks under various parameter dimension $d$ and a varying number of sequential episodes $T$ ($x$-axis). The parallelism $N$ is set to 4 and each curve denotes the mean from 3 independent runs.
Figure 4: Comparison of the test error or training loss ($y$-axis) achieved by different optimizers when training deep neural networks on (a) CIFAR-10 and (b) Shakespeare Corpus with a varying number $T$ of sequential iterations or a varying wallclock time ($x$-axis) . The parallelism $N$ is set to 4 and each curve denotes the mean from 5 (for CIFAR-10) or 3 (for Shakespeare corpus) independent runs. The wallclock time is evaluated on a single NVIDIA RTX 4090 GPU.
Figure 5: An illustrated comparison among our OptEx and all the baselines at iteration $t$.
...and 5 more figures

Theorems & Definitions (22)

Proposition 4.1
Theorem 1: Gradient Estimation Error
Corollary 1: Concrete Error Bounds
Theorem 2: Upper Bound
Theorem 3: Lower Bound
Corollary 2: Acceleration Rate
Lemma A.1: *laurent2000adaptive
Lemma A.2: Lemma 2 in Appx. B of ChowdhuryG21
Lemma A.3: Sherman-Morrison formula
Lemma A.4: Non-Increasing Variance Norm
...and 12 more

OptEx: Expediting First-Order Optimization with Approximately Parallelized Iterations

TL;DR

Abstract

OptEx: Expediting First-Order Optimization with Approximately Parallelized Iterations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (22)