Table of Contents
Fetching ...

Optimistic Gradient Learning with Hessian Corrections for High-Dimensional Black-Box Optimization

Yedidya Kfir, Elad Sarafian, Sarit Kraus, Yoram Louzoun

TL;DR

This work tackles high-dimensional black-box optimization by introducing Optimistic Hessian Gradient Learning (OHGL), which unifies Evolutionary Gradient Learning (EvoGrad) and Higher-Order Gradient Learning (HGrad). EvoGrad biases gradient estimates toward globally favorable regions using CMA-ES derived weights, while HGrad injects Hessian information to improve gradient accuracy, yielding faster convergence with provable controllable accuracy. The EvoGrad2 variant combines both ideas, using Monte Carlo pairwise sampling and detaching Jacobians to enable scalable learning in large-scale problems; it achieves state-of-the-art results on the COCO suite and demonstrates applicability to adversarial training and code generation. The results show robust performance across dimensions, budgets, and accuracy requirements, highlighting EvoGrad2 as a practical tool for high-dimensional, non-linear black-box optimization in ML research and real-world tasks.

Abstract

Black-box algorithms are designed to optimize functions without relying on their underlying analytical structure or gradient information, making them essential when gradients are inaccessible or difficult to compute. Traditional methods for solving black-box optimization (BBO) problems predominantly rely on non-parametric models and struggle to scale to large input spaces. Conversely, parametric methods that model the function with neural estimators and obtain gradient signals via backpropagation may suffer from significant gradient errors. A recent alternative, Explicit Gradient Learning (EGL), which directly learns the gradient using a first-order Taylor approximation, has demonstrated superior performance over both parametric and non-parametric methods. In this work, we propose two novel gradient learning variants to address the robustness challenges posed by high-dimensional, complex, and highly non-linear problems. Optimistic Gradient Learning (OGL) introduces a bias toward lower regions in the function landscape, while Higher-order Gradient Learning (HGL) incorporates second-order Taylor corrections to improve gradient accuracy. We combine these approaches into the unified OHGL algorithm, achieving state-of-the-art (SOTA) performance on the synthetic COCO suite. Additionally, we demonstrate OHGLs applicability to high-dimensional real-world machine learning (ML) tasks such as adversarial training and code generation. Our results highlight OHGLs ability to generate stronger candidates, offering a valuable tool for ML researchers and practitioners tackling high-dimensional, non-linear optimization challenges

Optimistic Gradient Learning with Hessian Corrections for High-Dimensional Black-Box Optimization

TL;DR

This work tackles high-dimensional black-box optimization by introducing Optimistic Hessian Gradient Learning (OHGL), which unifies Evolutionary Gradient Learning (EvoGrad) and Higher-Order Gradient Learning (HGrad). EvoGrad biases gradient estimates toward globally favorable regions using CMA-ES derived weights, while HGrad injects Hessian information to improve gradient accuracy, yielding faster convergence with provable controllable accuracy. The EvoGrad2 variant combines both ideas, using Monte Carlo pairwise sampling and detaching Jacobians to enable scalable learning in large-scale problems; it achieves state-of-the-art results on the COCO suite and demonstrates applicability to adversarial training and code generation. The results show robust performance across dimensions, budgets, and accuracy requirements, highlighting EvoGrad2 as a practical tool for high-dimensional, non-linear black-box optimization in ML research and real-world tasks.

Abstract

Black-box algorithms are designed to optimize functions without relying on their underlying analytical structure or gradient information, making them essential when gradients are inaccessible or difficult to compute. Traditional methods for solving black-box optimization (BBO) problems predominantly rely on non-parametric models and struggle to scale to large input spaces. Conversely, parametric methods that model the function with neural estimators and obtain gradient signals via backpropagation may suffer from significant gradient errors. A recent alternative, Explicit Gradient Learning (EGL), which directly learns the gradient using a first-order Taylor approximation, has demonstrated superior performance over both parametric and non-parametric methods. In this work, we propose two novel gradient learning variants to address the robustness challenges posed by high-dimensional, complex, and highly non-linear problems. Optimistic Gradient Learning (OGL) introduces a bias toward lower regions in the function landscape, while Higher-order Gradient Learning (HGL) incorporates second-order Taylor corrections to improve gradient accuracy. We combine these approaches into the unified OHGL algorithm, achieving state-of-the-art (SOTA) performance on the synthetic COCO suite. Additionally, we demonstrate OHGLs applicability to high-dimensional real-world machine learning (ML) tasks such as adversarial training and code generation. Our results highlight OHGLs ability to generate stronger candidates, offering a valuable tool for ML researchers and practitioners tackling high-dimensional, non-linear optimization challenges

Paper Structure

This paper contains 29 sections, 7 theorems, 62 equations, 15 figures, 5 tables, 2 algorithms.

Key Result

Theorem 3.1

(Evolutionary Gradient Controllable Accuracy) For any differentiable function $f$ with a continuous gradient, there exists $\kappa^{EvoGrad} > 0$ such that for any $\varepsilon > 0$, $g^{EvoGrad}_{\varepsilon}(x)$ satisfies

Figures (15)

  • Figure 1: EvoGrad vs EGL trajectories. 1st row: Gallagher’s Gaussian 101-me. 2nd row: 21-hi.
  • Figure 2: The probability for each algorithm to find the correct direction to the global minimum at a randomly selected starting point, based on the epsilon size.
  • Figure 3: The probability of each algorithm to solve a problem in each dimension
  • Figure 4: Experiment results against the baseline: (a) Convergence for all our algorithms against baseline algorithms, (b) Success rate as a function of the normalized distance from the best-known solution, (c) Percentage of solved algorithms when the distance from the best point is 0.01
  • Figure 5: Original
  • ...and 10 more figures

Theorems & Definitions (9)

  • Theorem 3.1
  • Theorem 4.1
  • Definition J.1
  • Theorem J.2
  • Definition J.3
  • Theorem J.4
  • Corollary J.5
  • Corollary J.6
  • Corollary J.7