Achieving ${O}(ε^{-1.5})$ Complexity in Hessian/Jacobian-free Stochastic Bilevel Optimization

Yifan Yang; Peiyao Xiao; Kaiyi Ji

Achieving ${O}(ε^{-1.5})$ Complexity in Hessian/Jacobian-free Stochastic Bilevel Optimization

Yifan Yang, Peiyao Xiao, Kaiyi Ji

TL;DR

This work tackles stochastic bilevel optimization with a nonconvex upper level and a strongly convex lower level, aiming to achieve $O(\epsilon^{-1.5})$ sample complexity using only first-order information. It introduces FdeHBO, a Hessian/Jacobian-free, fully single-loop optimizer that employs a projection-aided finite-difference scheme to approximate Hessian/Jacobian actions and momentum-based updates for $y$, $v$, and $x$. Theoretical guarantees show $\mathbb{E}\|\nabla \Phi(x)\|^2$ decays at a rate $\tilde O(1/T^{2/3})$ with $\tilde O(\epsilon^{-1.5})$ samples needed to reach an $\\epsilon$-accurate stationary point, representing the first such result without second-order computations. A small-dimension variant, FMBO, preserves the same complexity with a simpler per-iteration Hessian-vector computation. Experiments on MNIST hyper-representation and hyper-cleaning corroborate the theory, demonstrating faster convergence and competitive accuracy against state-of-the-art Hessian/Jacobian-free and fully first-order methods.

Abstract

In this paper, we revisit the bilevel optimization problem, in which the upper-level objective function is generally nonconvex and the lower-level objective function is strongly convex. Although this type of problem has been studied extensively, it still remains an open question how to achieve an ${O}(ε^{-1.5})$ sample complexity in Hessian/Jacobian-free stochastic bilevel optimization without any second-order derivative computation. To fill this gap, we propose a novel Hessian/Jacobian-free bilevel optimizer named FdeHBO, which features a simple fully single-loop structure, a projection-aided finite-difference Hessian/Jacobian-vector approximation, and momentum-based updates. Theoretically, we show that FdeHBO requires ${O}(ε^{-1.5})$ iterations (each using ${O}(1)$ samples and only first-order gradient information) to find an $ε$-accurate stationary point. As far as we know, this is the first Hessian/Jacobian-free method with an ${O}(ε^{-1.5})$ sample complexity for nonconvex-strongly-convex stochastic bilevel optimization.

Achieving ${O}(ε^{-1.5})$ Complexity in Hessian/Jacobian-free Stochastic Bilevel Optimization

TL;DR

This work tackles stochastic bilevel optimization with a nonconvex upper level and a strongly convex lower level, aiming to achieve

sample complexity using only first-order information. It introduces FdeHBO, a Hessian/Jacobian-free, fully single-loop optimizer that employs a projection-aided finite-difference scheme to approximate Hessian/Jacobian actions and momentum-based updates for

, and

. Theoretical guarantees show

decays at a rate

with

samples needed to reach an

-accurate stationary point, representing the first such result without second-order computations. A small-dimension variant, FMBO, preserves the same complexity with a simpler per-iteration Hessian-vector computation. Experiments on MNIST hyper-representation and hyper-cleaning corroborate the theory, demonstrating faster convergence and competitive accuracy against state-of-the-art Hessian/Jacobian-free and fully first-order methods.

Abstract

sample complexity in Hessian/Jacobian-free stochastic bilevel optimization without any second-order derivative computation. To fill this gap, we propose a novel Hessian/Jacobian-free bilevel optimizer named FdeHBO, which features a simple fully single-loop structure, a projection-aided finite-difference Hessian/Jacobian-vector approximation, and momentum-based updates. Theoretically, we show that FdeHBO requires

iterations (each using

samples and only first-order gradient information) to find an

-accurate stationary point. As far as we know, this is the first Hessian/Jacobian-free method with an

sample complexity for nonconvex-strongly-convex stochastic bilevel optimization.

Paper Structure (36 sections, 26 theorems, 157 equations, 2 figures, 1 table, 2 algorithms)

This paper contains 36 sections, 26 theorems, 157 equations, 2 figures, 1 table, 2 algorithms.

Introduction
Our Contributions
Related Work
Algorithms
Hypergradient Computation
Hessian/Jacobian-free Bilevel Optimizer via Projection-aided Finite-difference Estimation
Extension to Small-Dimensional Case
Main Results
Assumptions and Definitions
Convergence and Complexity Analysis of FdeHBO
Convergence and Complexity Analysis of FMBO
Experiments
Hyper-representation on MNIST Dataset
Hyper-cleaning on MNIST Dataset
Conclusion
...and 21 more sections

Key Result

Proposition 1

Under Assumption as:sf, the iterates of the outer problem by alg:main_free satisfy for all $t \in \{0, . . . , T-1\}$ with $L_F^2 = 2(L_{f_x}^2 + L^2_{g_{xy}}r_v^2)$.

Figures (2)

Figure 1: Comparison on hyper-representation with the LeNet neural network. Left plot: outer loss v.s. running time; right plot: accuracy v.s. running time.
Figure 2: (a) Comparison of different algorithms on data hyper-cleaning with noise $p=0.1$. Left plot: test loss v.s. running time; right plot: train loss v.s. running time. (b) Comparison among different single-loop algorithms: training loss v.s. running time.

Theorems & Definitions (46)

Definition 1
Proposition 1
Proposition 2
Proposition 3
Theorem 1
Corollary 1
Theorem 2
Corollary 2
Lemma 1: Boundedness of $v^*$
proof
...and 36 more

Achieving ${O}(ε^{-1.5})$ Complexity in Hessian/Jacobian-free Stochastic Bilevel Optimization

TL;DR

Abstract

Achieving ${O}(ε^{-1.5})$ Complexity in Hessian/Jacobian-free Stochastic Bilevel Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (46)