Unlocking Global Optimality in Bilevel Optimization: A Pilot Study
Quan Xiao, Tianyi Chen
TL;DR
This work tackles global optimality in bilevel optimization by formulating a penalty-based constrained surrogate $\mathsf{L}_\gamma(u,v)=f(u,v)+\gamma\,(g(u,v)-g^*(u))$ and developing a globally convergent first-order algorithm, penalty-based bilevel gradient descent (PBGD). It introduces two benign landscape conditions—joint PL and blockwise PL—for $\mathsf{L}_\gamma$ and proves that PBGD converges globally to the bilevel optimum under Jacobi or Gauss-Seidel updates, with the penalty parameter $\gamma=\mathcal{O}(\epsilon^{-0.5})$ guiding the accuracy $\epsilon$ and a contraction rate of $\mathcal{O}(\log(1/\epsilon)^2)$ iterations. The framework is validated on two representative bilevel learning problems, representation learning and data hyper-cleaning, where the penalized landscape exhibits benign properties and the algorithms attain globally optimal solutions, supported by numerical experiments. Collectively, the results offer a principled path to global optimality in structured bilevel problems and have implications for the reliability and safety of AI systems relying on bilevel formulations.
Abstract
Bilevel optimization has witnessed a resurgence of interest, driven by its critical role in trustworthy and efficient AI applications. While many recent works have established convergence to stationary points or local minima, obtaining the global optimum of bilevel optimization remains an important yet open problem. The difficulty lies in the fact that, unlike many prior non-convex single-level problems, bilevel problems often do not admit a benign landscape, and may indeed have multiple spurious local solutions. Nevertheless, attaining global optimality is indispensable for ensuring reliability, safety, and cost-effectiveness, particularly in high-stakes engineering applications that rely on bilevel optimization. In this paper, we first explore the challenges of establishing a global convergence theory for bilevel optimization, and present two sufficient conditions for global convergence. We provide algorithm-dependent proofs to rigorously substantiate these sufficient conditions on two specific bilevel learning scenarios: representation learning and data hypercleaning (a.k.a. reweighting). Experiments corroborate the theoretical findings, demonstrating convergence to the global minimum in both cases.
