Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

Feihu Huang

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

Feihu Huang

TL;DR

We address nonconvex-PL bilevel optimization where the upper-level is potentially nonconvex and the lower-level is nonconvex but satisfies the PL condition. The proposed method, HJFBiO, is Hessian/Jacobian-free and relies on finite-difference estimators and a novel projection-based hypergradient surrogate, achieving a convergence rate of $O\left(\frac{1}{T}\right)$ and a gradient complexity of $O(\varepsilon^{-1})$ for an $\varepsilon$-stationary point. The framework supports both global and local PL lower-level structures, with an $O(p+d)$ per-iteration cost and proven optimality in gradient complexity, and is validated on bilevel PL games and hyper-representation learning tasks. This approach eliminates the need to form Hessians or their inverses, enabling scalable bilevel optimization in practical ML settings.

Abstract

Bilevel optimization is widely applied in many machine learning tasks such as hyper-parameter learning, meta learning and reinforcement learning. Although many algorithms recently have been developed to solve the bilevel optimization problems, they generally rely on the (strongly) convex lower-level problems. More recently, some methods have been proposed to solve the nonconvex-PL bilevel optimization problems, where their upper-level problems are possibly nonconvex, and their lower-level problems are also possibly nonconvex while satisfying Polyak-Łojasiewicz (PL) condition. However, these methods still have a high convergence complexity or a high computation complexity such as requiring compute expensive Hessian/Jacobian matrices and its inverses. In the paper, thus, we propose an efficient Hessian/Jacobian-free method (i.e., HJFBiO) with the optimal convergence complexity to solve the nonconvex-PL bilevel problems. Theoretically, under some mild conditions, we prove that our HJFBiO method obtains an optimal convergence rate of $O(\frac{1}{T})$, where $T$ denotes the number of iterations, and has an optimal gradient complexity of $O(ε^{-1})$ in finding an $ε$-stationary solution. We conduct some numerical experiments on the bilevel PL game and hyper-representation learning task to demonstrate efficiency of our proposed method.

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

TL;DR

and a gradient complexity of

for an

-stationary point. The framework supports both global and local PL lower-level structures, with an

per-iteration cost and proven optimality in gradient complexity, and is validated on bilevel PL games and hyper-representation learning tasks. This approach eliminates the need to form Hessians or their inverses, enabling scalable bilevel optimization in practical ML settings.

Abstract

, where

denotes the number of iterations, and has an optimal gradient complexity of

in finding an

-stationary solution. We conduct some numerical experiments on the bilevel PL game and hyper-representation learning task to demonstrate efficiency of our proposed method.

Paper Structure (16 sections, 17 theorems, 109 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 16 sections, 17 theorems, 109 equations, 3 figures, 1 table, 1 algorithm.

Introduction
Preliminaries
Mild Assumptions
Useful Lemmas
Efficient Hessian/Jacobian-Free Bilevel Optimization Method
Convergence Analysis
Convergence Properties of Our Algorithm on Unimodal $g(x,y)$
Convergence Properties of Our Algorithm on multimodal $g(x,y)$
Experiments
Bilevel Polyak-Łojasiewicz Game
Hyper-Representation Learning
Conclusions
Detailed Convergence Analysis
Convergence Analysis of HJFBiO Algorithm for Bilevel Optimization with Regularization
Convergence Analysis of of HJFBiO Algorithm for Bilevel Optimization without Regularization
...and 1 more sections

Key Result

Lemma 2.6

(huang2023momentum) Under the above Assumption ass:2, we have, for any $x\in \mathbb{R}^d$,

Figures (3)

Figure 1: PL Game: norm of gradient vs number of iteration under $d=100$ (Left) and $d=200$ (Right).
Figure 2: Distances of the algorithms under the case of $d=100$ (Left) and $d=200$ (Right).
Figure 3: Losses of the algorithms under the case of $d=100$ (Left) and $d=200$ (Right).

Theorems & Definitions (27)

Lemma 2.6
Lemma 2.7
Lemma 2.8
Definition 3.1
Lemma 4.1
Lemma 4.2
Lemma 4.3
Theorem 4.4
Remark 4.5
Definition 4.6
...and 17 more

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

TL;DR

Abstract

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (27)