Unlocking Global Optimality in Bilevel Optimization: A Pilot Study

Quan Xiao; Tianyi Chen

Unlocking Global Optimality in Bilevel Optimization: A Pilot Study

Quan Xiao, Tianyi Chen

TL;DR

This work tackles global optimality in bilevel optimization by formulating a penalty-based constrained surrogate $\mathsf{L}_\gamma(u,v)=f(u,v)+\gamma\,(g(u,v)-g^*(u))$ and developing a globally convergent first-order algorithm, penalty-based bilevel gradient descent (PBGD). It introduces two benign landscape conditions—joint PL and blockwise PL—for $\mathsf{L}_\gamma$ and proves that PBGD converges globally to the bilevel optimum under Jacobi or Gauss-Seidel updates, with the penalty parameter $\gamma=\mathcal{O}(\epsilon^{-0.5})$ guiding the accuracy $\epsilon$ and a contraction rate of $\mathcal{O}(\log(1/\epsilon)^2)$ iterations. The framework is validated on two representative bilevel learning problems, representation learning and data hyper-cleaning, where the penalized landscape exhibits benign properties and the algorithms attain globally optimal solutions, supported by numerical experiments. Collectively, the results offer a principled path to global optimality in structured bilevel problems and have implications for the reliability and safety of AI systems relying on bilevel formulations.

Abstract

Bilevel optimization has witnessed a resurgence of interest, driven by its critical role in trustworthy and efficient AI applications. While many recent works have established convergence to stationary points or local minima, obtaining the global optimum of bilevel optimization remains an important yet open problem. The difficulty lies in the fact that, unlike many prior non-convex single-level problems, bilevel problems often do not admit a benign landscape, and may indeed have multiple spurious local solutions. Nevertheless, attaining global optimality is indispensable for ensuring reliability, safety, and cost-effectiveness, particularly in high-stakes engineering applications that rely on bilevel optimization. In this paper, we first explore the challenges of establishing a global convergence theory for bilevel optimization, and present two sufficient conditions for global convergence. We provide algorithm-dependent proofs to rigorously substantiate these sufficient conditions on two specific bilevel learning scenarios: representation learning and data hypercleaning (a.k.a. reweighting). Experiments corroborate the theoretical findings, demonstrating convergence to the global minimum in both cases.

Unlocking Global Optimality in Bilevel Optimization: A Pilot Study

TL;DR

This work tackles global optimality in bilevel optimization by formulating a penalty-based constrained surrogate

and developing a globally convergent first-order algorithm, penalty-based bilevel gradient descent (PBGD). It introduces two benign landscape conditions—joint PL and blockwise PL—for

and proves that PBGD converges globally to the bilevel optimum under Jacobi or Gauss-Seidel updates, with the penalty parameter

guiding the accuracy

and a contraction rate of

iterations. The framework is validated on two representative bilevel learning problems, representation learning and data hyper-cleaning, where the penalized landscape exhibits benign properties and the algorithms attain globally optimal solutions, supported by numerical experiments. Collectively, the results offer a principled path to global optimality in structured bilevel problems and have implications for the reliability and safety of AI systems relying on bilevel formulations.

Abstract

Paper Structure (58 sections, 38 theorems, 244 equations, 7 figures, 2 algorithms)

This paper contains 58 sections, 38 theorems, 244 equations, 7 figures, 2 algorithms.

Introduction
Our main results
Related works
Novelty and technical Challenges
Challenges and Target of Convergence
Challenges in the nested formulation of bilevel optimization
Seeking global optimum via penalty reformulation
Global Convergence Condition in Bilevel Optimization
Benign landscape conditions
A globally convergent algorithm: penalty-based bilevel gradient descent
Ensuring Global Convergence Conditions
Global Convergence in Representation Learning
Global Convergence in Data Hyper-cleaning
Numerical Experiments
Conclusions
...and 43 more sections

Key Result

Theorem 1

Suppose Assumption ass-general holds, then $g^*(u)$ is smooth with $L_g:=\ell_{g}(1+\ell_{g}/2\mu_g)$. Given a target accuracy $\epsilon$, we set $\gamma={\cal O}(\epsilon^{-0.5})$, stepsizes $\beta\leq\frac{1}{\ell_g}$, inner loop $T_k={\cal O}\left(\log\left(\gamma^2{\epsilon}^{-1}\right)\right)$. for any $(u,v)$ with $g(u,v)-g^*(u)\leq\epsilon_\gamma={\cal O}(\epsilon)$. Alternatively, if $\mat

Figures (7)

Figure 1: Visualization of $g(u,v), f(u,v), \mathsf{F}(u)$ and $\mathsf{L}_\gamma (u,v)$ in Example \ref{['ex1']}. In (b), $f(u,v)$ is PL but $F(u)$ is distorted by $\mathcal{S}(u)$. In (c), saddle points exist for $\mathsf{F}(u)$, suggesting that $\mathsf{F}(u)$ is not PL. In (d), the penalty objective $\mathsf{L}_\gamma(u,v)$ has better landscape because of additional dimension of $v$.
Figure 2: The landscape of $\mathsf{F}(W_1)$ and $\mathsf{L}_\gamma(W_1,W_2)$ with different penalty constant $\gamma=0.1,1,50$ in representation learning. The orange terrain is $\mathsf{F}(W_1)$, while the blue surface is $\mathsf{L}_\gamma(W_1,W_2)$. The black line is the trajectory of PBGD which converges to the global optimum of bilevel loss.
Figure 3: Relative errors at upper-level and lower-level in $\log$ scale of PBGD versus iteration $K$ under different $\gamma$, where $\mathsf{L}_{\rm val}^*=\min_{W_1,W_2\in\mathcal{S}(W_1)} \mathsf{L}_{\rm val}(W_1,W_2)$ and $\ell_{\rm val}^*=\min_{u,W\in\mathcal{S}(u)} \ell_{\rm val}(W)$. (a)--(b) are for PBGD \ref{['penalty-alg1']} in representation learning, and (c)-(d) are for PBGD \ref{['penalty-alg1e']} in data hyper-cleaning.
Figure 4: Relative errors in $\log$ scale of different methods under different stepsizes in representation learning. (a)--(b) are ablation study for stepsizes $\alpha,\beta$ in PBGD, and (c)-(d) are for different methods.
Figure 5: Visualization of $g(u,v), f(u,v)$ and $\mathsf{F}(u)$ in Example \ref{['ex2']}.
...and 2 more figures

Theorems & Definitions (76)

Definition 1: Joint and blockwise PL condition
Example 1
Definition 2: An $(\epsilon_1,\epsilon_2)$ solution of bilevel problem
Theorem 1: Global convergence of PBGD
Remark 1: Other choices of algorithms
Lemma 1: Joint PL condition and descent lemma over trajectory
Theorem 2: Global convergence of PBGD for representation learning
Lemma 2: Blockwise PL condition
Theorem 3: Global convergence for data hyper-cleaning
Lemma 3: Matrix Inequality
...and 66 more

Unlocking Global Optimality in Bilevel Optimization: A Pilot Study

TL;DR

Abstract

Unlocking Global Optimality in Bilevel Optimization: A Pilot Study

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (76)