Table of Contents
Fetching ...

Unlocking Global Optimality in Bilevel Optimization: A Pilot Study

Quan Xiao, Tianyi Chen

TL;DR

This work tackles global optimality in bilevel optimization by formulating a penalty-based constrained surrogate $\mathsf{L}_\gamma(u,v)=f(u,v)+\gamma\,(g(u,v)-g^*(u))$ and developing a globally convergent first-order algorithm, penalty-based bilevel gradient descent (PBGD). It introduces two benign landscape conditions—joint PL and blockwise PL—for $\mathsf{L}_\gamma$ and proves that PBGD converges globally to the bilevel optimum under Jacobi or Gauss-Seidel updates, with the penalty parameter $\gamma=\mathcal{O}(\epsilon^{-0.5})$ guiding the accuracy $\epsilon$ and a contraction rate of $\mathcal{O}(\log(1/\epsilon)^2)$ iterations. The framework is validated on two representative bilevel learning problems, representation learning and data hyper-cleaning, where the penalized landscape exhibits benign properties and the algorithms attain globally optimal solutions, supported by numerical experiments. Collectively, the results offer a principled path to global optimality in structured bilevel problems and have implications for the reliability and safety of AI systems relying on bilevel formulations.

Abstract

Bilevel optimization has witnessed a resurgence of interest, driven by its critical role in trustworthy and efficient AI applications. While many recent works have established convergence to stationary points or local minima, obtaining the global optimum of bilevel optimization remains an important yet open problem. The difficulty lies in the fact that, unlike many prior non-convex single-level problems, bilevel problems often do not admit a benign landscape, and may indeed have multiple spurious local solutions. Nevertheless, attaining global optimality is indispensable for ensuring reliability, safety, and cost-effectiveness, particularly in high-stakes engineering applications that rely on bilevel optimization. In this paper, we first explore the challenges of establishing a global convergence theory for bilevel optimization, and present two sufficient conditions for global convergence. We provide algorithm-dependent proofs to rigorously substantiate these sufficient conditions on two specific bilevel learning scenarios: representation learning and data hypercleaning (a.k.a. reweighting). Experiments corroborate the theoretical findings, demonstrating convergence to the global minimum in both cases.

Unlocking Global Optimality in Bilevel Optimization: A Pilot Study

TL;DR

This work tackles global optimality in bilevel optimization by formulating a penalty-based constrained surrogate and developing a globally convergent first-order algorithm, penalty-based bilevel gradient descent (PBGD). It introduces two benign landscape conditions—joint PL and blockwise PL—for and proves that PBGD converges globally to the bilevel optimum under Jacobi or Gauss-Seidel updates, with the penalty parameter guiding the accuracy and a contraction rate of iterations. The framework is validated on two representative bilevel learning problems, representation learning and data hyper-cleaning, where the penalized landscape exhibits benign properties and the algorithms attain globally optimal solutions, supported by numerical experiments. Collectively, the results offer a principled path to global optimality in structured bilevel problems and have implications for the reliability and safety of AI systems relying on bilevel formulations.

Abstract

Bilevel optimization has witnessed a resurgence of interest, driven by its critical role in trustworthy and efficient AI applications. While many recent works have established convergence to stationary points or local minima, obtaining the global optimum of bilevel optimization remains an important yet open problem. The difficulty lies in the fact that, unlike many prior non-convex single-level problems, bilevel problems often do not admit a benign landscape, and may indeed have multiple spurious local solutions. Nevertheless, attaining global optimality is indispensable for ensuring reliability, safety, and cost-effectiveness, particularly in high-stakes engineering applications that rely on bilevel optimization. In this paper, we first explore the challenges of establishing a global convergence theory for bilevel optimization, and present two sufficient conditions for global convergence. We provide algorithm-dependent proofs to rigorously substantiate these sufficient conditions on two specific bilevel learning scenarios: representation learning and data hypercleaning (a.k.a. reweighting). Experiments corroborate the theoretical findings, demonstrating convergence to the global minimum in both cases.
Paper Structure (58 sections, 38 theorems, 244 equations, 7 figures, 2 algorithms)

This paper contains 58 sections, 38 theorems, 244 equations, 7 figures, 2 algorithms.

Key Result

Theorem 1

Suppose Assumption ass-general holds, then $g^*(u)$ is smooth with $L_g:=\ell_{g}(1+\ell_{g}/2\mu_g)$. Given a target accuracy $\epsilon$, we set $\gamma={\cal O}(\epsilon^{-0.5})$, stepsizes $\beta\leq\frac{1}{\ell_g}$, inner loop $T_k={\cal O}\left(\log\left(\gamma^2{\epsilon}^{-1}\right)\right)$. for any $(u,v)$ with $g(u,v)-g^*(u)\leq\epsilon_\gamma={\cal O}(\epsilon)$. Alternatively, if $\mat

Figures (7)

  • Figure 1: Visualization of $g(u,v), f(u,v), \mathsf{F}(u)$ and $\mathsf{L}_\gamma (u,v)$ in Example \ref{['ex1']}. In (b), $f(u,v)$ is PL but $F(u)$ is distorted by $\mathcal{S}(u)$. In (c), saddle points exist for $\mathsf{F}(u)$, suggesting that $\mathsf{F}(u)$ is not PL. In (d), the penalty objective $\mathsf{L}_\gamma(u,v)$ has better landscape because of additional dimension of $v$.
  • Figure 2: The landscape of $\mathsf{F}(W_1)$ and $\mathsf{L}_\gamma(W_1,W_2)$ with different penalty constant $\gamma=0.1,1,50$ in representation learning. The orange terrain is $\mathsf{F}(W_1)$, while the blue surface is $\mathsf{L}_\gamma(W_1,W_2)$. The black line is the trajectory of PBGD which converges to the global optimum of bilevel loss.
  • Figure 3: Relative errors at upper-level and lower-level in $\log$ scale of PBGD versus iteration $K$ under different $\gamma$, where $\mathsf{L}_{\rm val}^*=\min_{W_1,W_2\in\mathcal{S}(W_1)} \mathsf{L}_{\rm val}(W_1,W_2)$ and $\ell_{\rm val}^*=\min_{u,W\in\mathcal{S}(u)} \ell_{\rm val}(W)$. (a)--(b) are for PBGD \ref{['penalty-alg1']} in representation learning, and (c)-(d) are for PBGD \ref{['penalty-alg1e']} in data hyper-cleaning.
  • Figure 4: Relative errors in $\log$ scale of different methods under different stepsizes in representation learning. (a)--(b) are ablation study for stepsizes $\alpha,\beta$ in PBGD, and (c)-(d) are for different methods.
  • Figure 5: Visualization of $g(u,v), f(u,v)$ and $\mathsf{F}(u)$ in Example \ref{['ex2']}.
  • ...and 2 more figures

Theorems & Definitions (76)

  • Definition 1: Joint and blockwise PL condition
  • Example 1
  • Definition 2: An $(\epsilon_1,\epsilon_2)$ solution of bilevel problem
  • Theorem 1: Global convergence of PBGD
  • Remark 1: Other choices of algorithms
  • Lemma 1: Joint PL condition and descent lemma over trajectory
  • Theorem 2: Global convergence of PBGD for representation learning
  • Lemma 2: Blockwise PL condition
  • Theorem 3: Global convergence for data hyper-cleaning
  • Lemma 3: Matrix Inequality
  • ...and 66 more