Table of Contents
Fetching ...

Derivative-Free Optimization via Finite Difference Approximation: An Experimental Study

Wang Du-Yi, Liang Guo, Liu Guangwu, Zhang Kun

TL;DR

This work analyzes a key trade-off in derivative-free optimization: gradient estimation accuracy versus iteration count. It compares KW and SPSA, which use two function evaluations per iteration with diminishing step sizes, to batch-based Cor-CFD, which uses multiple samples to obtain higher-quality FD gradients and employs Armijo line search for adaptive steps. Across low- and high-dimensional synthetic problems and a real hyperparameter-tuning task, batch-based Cor-CFD-GD demonstrates faster convergence, better stability, and lower variance in many settings, illustrating the practical gains of investing in gradient estimation accuracy. The study suggests that batch-based FD approaches can outperform classical minimal-sample DFO methods in noisy black-box environments, with implications for scalable optimization in ML and engineering.

Abstract

Derivative-free optimization (DFO) is vital in solving complex optimization problems where only noisy function evaluations are available through an oracle. Within this domain, DFO via finite difference (FD) approximation has emerged as a powerful method. Two classical approaches are the Kiefer-Wolfowitz (KW) and simultaneous perturbation stochastic approximation (SPSA) algorithms, which estimate gradients using just two samples in each iteration to conserve samples. However, this approach yields imprecise gradient estimators, necessitating diminishing step sizes to ensure convergence, often resulting in slow optimization progress. In contrast, FD estimators constructed from batch samples approximate gradients more accurately. While gradient descent algorithms using batch-based FD estimators achieve more precise results in each iteration, they require more samples and permit fewer iterations. This raises a fundamental question: which approach is more effective -- KW-style methods or DFO with batch-based FD estimators? This paper conducts a comprehensive experimental comparison among these approaches, examining the fundamental trade-off between gradient estimation accuracy and iteration steps. Through extensive experiments in both low-dimensional and high-dimensional settings, we demonstrate a surprising finding: when an efficient batch-based FD estimator is applied, its corresponding gradient descent algorithm generally shows better performance compared to classical KW and SPSA algorithms in our tested scenarios.

Derivative-Free Optimization via Finite Difference Approximation: An Experimental Study

TL;DR

This work analyzes a key trade-off in derivative-free optimization: gradient estimation accuracy versus iteration count. It compares KW and SPSA, which use two function evaluations per iteration with diminishing step sizes, to batch-based Cor-CFD, which uses multiple samples to obtain higher-quality FD gradients and employs Armijo line search for adaptive steps. Across low- and high-dimensional synthetic problems and a real hyperparameter-tuning task, batch-based Cor-CFD-GD demonstrates faster convergence, better stability, and lower variance in many settings, illustrating the practical gains of investing in gradient estimation accuracy. The study suggests that batch-based FD approaches can outperform classical minimal-sample DFO methods in noisy black-box environments, with implications for scalable optimization in ML and engineering.

Abstract

Derivative-free optimization (DFO) is vital in solving complex optimization problems where only noisy function evaluations are available through an oracle. Within this domain, DFO via finite difference (FD) approximation has emerged as a powerful method. Two classical approaches are the Kiefer-Wolfowitz (KW) and simultaneous perturbation stochastic approximation (SPSA) algorithms, which estimate gradients using just two samples in each iteration to conserve samples. However, this approach yields imprecise gradient estimators, necessitating diminishing step sizes to ensure convergence, often resulting in slow optimization progress. In contrast, FD estimators constructed from batch samples approximate gradients more accurately. While gradient descent algorithms using batch-based FD estimators achieve more precise results in each iteration, they require more samples and permit fewer iterations. This raises a fundamental question: which approach is more effective -- KW-style methods or DFO with batch-based FD estimators? This paper conducts a comprehensive experimental comparison among these approaches, examining the fundamental trade-off between gradient estimation accuracy and iteration steps. Through extensive experiments in both low-dimensional and high-dimensional settings, we demonstrate a surprising finding: when an efficient batch-based FD estimator is applied, its corresponding gradient descent algorithm generally shows better performance compared to classical KW and SPSA algorithms in our tested scenarios.

Paper Structure

This paper contains 10 sections, 1 theorem, 17 equations, 7 figures, 4 tables.

Key Result

Theorem 2.1

Assume that $\mu(x)$ is fifth differentiable at $x_k$ with non-zero fifth derivative, and $\mathrm{Var}[\epsilon(x)] > 0$ is continuous at $x_k$. For any $r = 1,...,R$$(R \geq 2)$, let $c_{k,r} = t_{k,r} n_k^{-1/10}$$(c_{k,r} \neq 0)$ and for any $s \neq r$, $c_{k,s} \neq c_{k,r}$. If $n_k \to \inft where $D = \mu^{(5)}(x_k)/120$, ${\boldsymbol{t}_k} = [|t_{k,1}|,...,|t_{k,R}|]^{\top}$, ${\boldsym

Figures (7)

  • Figure 1: Solution value comparison between Cor-CFD-GD and KW algorithms across sample pairs for $\mu(x) = x^4$ with noise level $\sigma=0.1$.
  • Figure 2: Solution value comparison between Cor-CFD-GD and KW algorithms across sample pairs for $\mu(x) = x^4$ with noise level $\sigma = 10$.
  • Figure 3: Solution value comparison between Cor-CFD-GD and KW algorithms across sample pairs for $\mu(x) = -100\cos (\pi x / 100)$.
  • Figure 4: Image of function \ref{['function213']} at $d=2$.
  • Figure 5: Solution gap comparison between Cor-CFD-GD and SPSA algorithms across sample pairs for function \ref{['function213']}.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Theorem 2.1: Theorem 4 in Liang2024efficient