On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation
Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, Robert Nowak
TL;DR
This work tackles nonconvex bilevel optimization with potentially constrained lower-level problems by introducing a penalty-based reformulation that blends the upper- and lower-level objectives via $h_{σ}(x,y)=σf(x,y)+g(x,y)$ and a scaled hyper-objective $ψ_{σ}(x)=(l(x,σ)-l(x,0))/σ$. Under a proximal error-bound condition, the authors prove that $ψ_{σ}$ closely approximates the original hyper-objective $ψ$ in both function value and gradient, with $|ψ_{σ}(x)-ψ(x)|=O(σ/μ)$ and $∥∇ψ_{σ}(x)-∇ψ(x)∥=O(σ/μ^{3})$, and provide an explicit gradient formula even when the lower level has multiple solutions. They develop two first-order schemes: a double-loop method with large batches and a fully single-loop, momentum-assisted method, achieving non-asymptotic convergence to $ε$-stationary points of the penalized problem with complexities ranging from $O(ε^{-3})$ to $O(ε^{-7})$ in stochastic settings, and $O(ε^{-3})$ in deterministic settings; the single-loop variant further improves to $O(ε^{-5})$ under mean-squared smoothness. The framework relies on a proximal-EB landscape regularity to link the penalized and original BO, and it supports extensions to constrained/PL-like LL scenarios, scalable stochastic optimization, and potential nonsmooth objective handling. Overall, the paper advances practical, first-order methods for broad classes of nonconvex bilevel problems with provable nonasymptotic guarantees.
Abstract
In this work, we study first-order algorithms for solving Bilevel Optimization (BO) where the objective functions are smooth but possibly nonconvex in both levels and the variables are restricted to closed convex sets. As a first step, we study the landscape of BO through the lens of penalty methods, in which the upper- and lower-level objectives are combined in a weighted sum with penalty parameter $σ> 0$. In particular, we establish a strong connection between the penalty function and the hyper-objective by explicitly characterizing the conditions under which the values and derivatives of the two must be $O(σ)$-close. A by-product of our analysis is the explicit formula for the gradient of hyper-objective when the lower-level problem has multiple solutions under minimal conditions, which could be of independent interest. Next, viewing the penalty formulation as $O(σ)$-approximation of the original BO, we propose first-order algorithms that find an $ε$-stationary solution by optimizing the penalty formulation with $σ= O(ε)$. When the perturbed lower-level problem uniformly satisfies the small-error proximal error-bound (EB) condition, we propose a first-order algorithm that converges to an $ε$-stationary point of the penalty function, using in total $O(ε^{-3})$ and $O(ε^{-7})$ accesses to first-order (stochastic) gradient oracles when the oracle is deterministic and oracles are noisy, respectively. Under an additional assumption on stochastic oracles, we show that the algorithm can be implemented in a fully {\it single-loop} manner, i.e., with $O(1)$ samples per iteration, and achieves the improved oracle-complexity of $O(ε^{-3})$ and $O(ε^{-5})$, respectively.
