Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient

Hao Di; Haishan Ye; Yueling Zhang; Xiangyu Chang; Guang Dai; Ivor W. Tsang

Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient

Hao Di, Haishan Ye, Yueling Zhang, Xiangyu Chang, Guang Dai, Ivor W. Tsang

TL;DR

The paper tackles zeroth-order optimization for composite finite-sum objectives by addressing the persistent coordinate-wise variance that arises when only random gradient estimates are available. It introduces Zeroth-order Proximal Double Variance Reduction (ZPDVR), which combines Gaussian smoothing with an averaging-based gradient estimator and a loopless SVRG-style update to reduce both sampling and coordinate-wise variance without needing full first-order information. The authors prove linear convergence with an optimal zeroth-order query complexity of $O(d(n+\kappa)\log(1/\epsilon))$ and show that only $O(1)$ zeroth-order queries are needed per iteration in expectation. Empirical results on binary classification tasks demonstrate fast convergence and superior performance relative to baselines, supporting ZPDVR's practicality for high-dimensional, black-box optimization.

Abstract

Variance reduction techniques are designed to decrease the sampling variance, thereby accelerating convergence rates of first-order (FO) and zeroth-order (ZO) optimization methods. However, in composite optimization problems, ZO methods encounter an additional variance called the coordinate-wise variance, which stems from the random gradient estimation. To reduce this variance, prior works require estimating all partial derivatives, essentially approximating FO information. This approach demands O(d) function evaluations (d is the dimension size), which incurs substantial computational costs and is prohibitive in high-dimensional scenarios. This paper proposes the Zeroth-order Proximal Double Variance Reduction (ZPDVR) method, which utilizes the averaging trick to reduce both sampling and coordinate-wise variances. Compared to prior methods, ZPDVR relies solely on random gradient estimates, calls the stochastic zeroth-order oracle (SZO) in expectation $\mathcal{O}(1)$ times per iteration, and achieves the optimal $\mathcal{O}(d(n + κ)\log (\frac{1}ε))$ SZO query complexity in the strongly convex and smooth setting, where $κ$ represents the condition number and $ε$ is the desired accuracy. Empirical results validate ZPDVR's linear convergence and demonstrate its superior performance over other related methods.

Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient

TL;DR

and show that only

zeroth-order queries are needed per iteration in expectation. Empirical results on binary classification tasks demonstrate fast convergence and superior performance relative to baselines, supporting ZPDVR's practicality for high-dimensional, black-box optimization.

Abstract

times per iteration, and achieves the optimal

SZO query complexity in the strongly convex and smooth setting, where

represents the condition number and

is the desired accuracy. Empirical results validate ZPDVR's linear convergence and demonstrate its superior performance over other related methods.

Paper Structure (11 sections, 16 theorems, 73 equations, 1 figure, 2 tables, 1 algorithm)

This paper contains 11 sections, 16 theorems, 73 equations, 1 figure, 2 tables, 1 algorithm.

Introduction
Related Work
Methodology
Gradient Estimate in Zeroth-Order Optimization
Coordinate-Wise Variance Reduction in Zeroth-Order Optimization
Convergence Analysis
Experiment
Conclusion
Some useful Lemmas
Missing Proofs
Hyperparameter Tuning

Key Result

Lemma 3.1

Let the random vector $u$ drawn from the multivariate Gaussian distribution ${\mathcal{N}}(0, I_d)$. For the $L$-smooth function $f_i$ and any $x\in {\mathbb R}^d$, $i\in [n]$, the estimator in Eq.eq:stochastic_directional_derivative satisfies: and its expectation w.r.t. $u$ is where $s_i(x, u)$ is a function of $x$ and $u$ within the range of $[0, 1]$, and $\tau_i(x, u)$ is the error term with

Figures (1)

Figure 1: Comparison of different zeroth-order methods for the loss residual $F(x) - F(x^*)$ versus the number of SZO. The $y$ axis is on a logarithmic scale and the $x$ label is the number of SZO divided by $n*d$.

Theorems & Definitions (30)

Lemma 3.1
Corollary 3.2
Remark 3.3
Remark 3.4
Lemma 3.5
Remark 3.6
Lemma 4.1
Lemma 4.2
Lemma 4.3
Lemma 4.4
...and 20 more

Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient

TL;DR

Abstract

Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (30)