Table of Contents
Fetching ...

On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, Robert Nowak

TL;DR

This work tackles nonconvex bilevel optimization with potentially constrained lower-level problems by introducing a penalty-based reformulation that blends the upper- and lower-level objectives via $h_{σ}(x,y)=σf(x,y)+g(x,y)$ and a scaled hyper-objective $ψ_{σ}(x)=(l(x,σ)-l(x,0))/σ$. Under a proximal error-bound condition, the authors prove that $ψ_{σ}$ closely approximates the original hyper-objective $ψ$ in both function value and gradient, with $|ψ_{σ}(x)-ψ(x)|=O(σ/μ)$ and $∥∇ψ_{σ}(x)-∇ψ(x)∥=O(σ/μ^{3})$, and provide an explicit gradient formula even when the lower level has multiple solutions. They develop two first-order schemes: a double-loop method with large batches and a fully single-loop, momentum-assisted method, achieving non-asymptotic convergence to $ε$-stationary points of the penalized problem with complexities ranging from $O(ε^{-3})$ to $O(ε^{-7})$ in stochastic settings, and $O(ε^{-3})$ in deterministic settings; the single-loop variant further improves to $O(ε^{-5})$ under mean-squared smoothness. The framework relies on a proximal-EB landscape regularity to link the penalized and original BO, and it supports extensions to constrained/PL-like LL scenarios, scalable stochastic optimization, and potential nonsmooth objective handling. Overall, the paper advances practical, first-order methods for broad classes of nonconvex bilevel problems with provable nonasymptotic guarantees.

Abstract

In this work, we study first-order algorithms for solving Bilevel Optimization (BO) where the objective functions are smooth but possibly nonconvex in both levels and the variables are restricted to closed convex sets. As a first step, we study the landscape of BO through the lens of penalty methods, in which the upper- and lower-level objectives are combined in a weighted sum with penalty parameter $σ> 0$. In particular, we establish a strong connection between the penalty function and the hyper-objective by explicitly characterizing the conditions under which the values and derivatives of the two must be $O(σ)$-close. A by-product of our analysis is the explicit formula for the gradient of hyper-objective when the lower-level problem has multiple solutions under minimal conditions, which could be of independent interest. Next, viewing the penalty formulation as $O(σ)$-approximation of the original BO, we propose first-order algorithms that find an $ε$-stationary solution by optimizing the penalty formulation with $σ= O(ε)$. When the perturbed lower-level problem uniformly satisfies the small-error proximal error-bound (EB) condition, we propose a first-order algorithm that converges to an $ε$-stationary point of the penalty function, using in total $O(ε^{-3})$ and $O(ε^{-7})$ accesses to first-order (stochastic) gradient oracles when the oracle is deterministic and oracles are noisy, respectively. Under an additional assumption on stochastic oracles, we show that the algorithm can be implemented in a fully {\it single-loop} manner, i.e., with $O(1)$ samples per iteration, and achieves the improved oracle-complexity of $O(ε^{-3})$ and $O(ε^{-5})$, respectively.

On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation

TL;DR

This work tackles nonconvex bilevel optimization with potentially constrained lower-level problems by introducing a penalty-based reformulation that blends the upper- and lower-level objectives via and a scaled hyper-objective . Under a proximal error-bound condition, the authors prove that closely approximates the original hyper-objective in both function value and gradient, with and , and provide an explicit gradient formula even when the lower level has multiple solutions. They develop two first-order schemes: a double-loop method with large batches and a fully single-loop, momentum-assisted method, achieving non-asymptotic convergence to -stationary points of the penalized problem with complexities ranging from to in stochastic settings, and in deterministic settings; the single-loop variant further improves to under mean-squared smoothness. The framework relies on a proximal-EB landscape regularity to link the penalized and original BO, and it supports extensions to constrained/PL-like LL scenarios, scalable stochastic optimization, and potential nonsmooth objective handling. Overall, the paper advances practical, first-order methods for broad classes of nonconvex bilevel problems with provable nonasymptotic guarantees.

Abstract

In this work, we study first-order algorithms for solving Bilevel Optimization (BO) where the objective functions are smooth but possibly nonconvex in both levels and the variables are restricted to closed convex sets. As a first step, we study the landscape of BO through the lens of penalty methods, in which the upper- and lower-level objectives are combined in a weighted sum with penalty parameter . In particular, we establish a strong connection between the penalty function and the hyper-objective by explicitly characterizing the conditions under which the values and derivatives of the two must be -close. A by-product of our analysis is the explicit formula for the gradient of hyper-objective when the lower-level problem has multiple solutions under minimal conditions, which could be of independent interest. Next, viewing the penalty formulation as -approximation of the original BO, we propose first-order algorithms that find an -stationary solution by optimizing the penalty formulation with . When the perturbed lower-level problem uniformly satisfies the small-error proximal error-bound (EB) condition, we propose a first-order algorithm that converges to an -stationary point of the penalty function, using in total and accesses to first-order (stochastic) gradient oracles when the oracle is deterministic and oracles are noisy, respectively. Under an additional assumption on stochastic oracles, we show that the algorithm can be implemented in a fully {\it single-loop} manner, i.e., with samples per iteration, and achieves the improved oracle-complexity of and , respectively.
Paper Structure (76 sections, 28 theorems, 216 equations, 1 figure, 2 algorithms)

This paper contains 76 sections, 28 theorems, 216 equations, 1 figure, 2 algorithms.

Key Result

Theorem 1.1

Under Assumption assumption:small_gradient_proximal_error_bound (with additional smoothness assumptions), for all $x \in \mathcal{X}$ such that at least one sufficiently regular solution path $y^*(x, \sigma)$ exists for $\sigma \in [0,\sigma_0]$, we have

Figures (1)

  • Figure 1: $\psi(x)$ and $\psi_{\sigma}(x)$ in Examples: (left) Example \ref{['example:bilinear_LL']}, (right) Example \ref{['example:sc_non_smooth']}. Blue dashed lines compare $\psi_{\sigma}(x)$ to the original hyper-objective $\psi(x)$.

Theorems & Definitions (42)

  • Definition 1
  • Theorem 1.1: Informal
  • Definition 2: Hausdorff Distance
  • Definition 3: Lipschitz Continuity of Solution Sets chen2023bilevel
  • Definition 4: Active Constraints
  • Definition 5: Linear Independence Constraint Qualification (LICQ)
  • Definition 6: Strict Complementarity
  • Example 1
  • Example 2
  • Definition 7
  • ...and 32 more