On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation

Jeongyeol Kwon; Dohyun Kwon; Stephen Wright; Robert Nowak

On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, Robert Nowak

TL;DR

This work tackles nonconvex bilevel optimization with potentially constrained lower-level problems by introducing a penalty-based reformulation that blends the upper- and lower-level objectives via $h_{σ}(x,y)=σf(x,y)+g(x,y)$ and a scaled hyper-objective $ψ_{σ}(x)=(l(x,σ)-l(x,0))/σ$. Under a proximal error-bound condition, the authors prove that $ψ_{σ}$ closely approximates the original hyper-objective $ψ$ in both function value and gradient, with $|ψ_{σ}(x)-ψ(x)|=O(σ/μ)$ and $∥∇ψ_{σ}(x)-∇ψ(x)∥=O(σ/μ^{3})$, and provide an explicit gradient formula even when the lower level has multiple solutions. They develop two first-order schemes: a double-loop method with large batches and a fully single-loop, momentum-assisted method, achieving non-asymptotic convergence to $ε$-stationary points of the penalized problem with complexities ranging from $O(ε^{-3})$ to $O(ε^{-7})$ in stochastic settings, and $O(ε^{-3})$ in deterministic settings; the single-loop variant further improves to $O(ε^{-5})$ under mean-squared smoothness. The framework relies on a proximal-EB landscape regularity to link the penalized and original BO, and it supports extensions to constrained/PL-like LL scenarios, scalable stochastic optimization, and potential nonsmooth objective handling. Overall, the paper advances practical, first-order methods for broad classes of nonconvex bilevel problems with provable nonasymptotic guarantees.

Abstract

In this work, we study first-order algorithms for solving Bilevel Optimization (BO) where the objective functions are smooth but possibly nonconvex in both levels and the variables are restricted to closed convex sets. As a first step, we study the landscape of BO through the lens of penalty methods, in which the upper- and lower-level objectives are combined in a weighted sum with penalty parameter $σ> 0$. In particular, we establish a strong connection between the penalty function and the hyper-objective by explicitly characterizing the conditions under which the values and derivatives of the two must be $O(σ)$-close. A by-product of our analysis is the explicit formula for the gradient of hyper-objective when the lower-level problem has multiple solutions under minimal conditions, which could be of independent interest. Next, viewing the penalty formulation as $O(σ)$-approximation of the original BO, we propose first-order algorithms that find an $ε$-stationary solution by optimizing the penalty formulation with $σ= O(ε)$. When the perturbed lower-level problem uniformly satisfies the small-error proximal error-bound (EB) condition, we propose a first-order algorithm that converges to an $ε$-stationary point of the penalty function, using in total $O(ε^{-3})$ and $O(ε^{-7})$ accesses to first-order (stochastic) gradient oracles when the oracle is deterministic and oracles are noisy, respectively. Under an additional assumption on stochastic oracles, we show that the algorithm can be implemented in a fully {\it single-loop} manner, i.e., with $O(1)$ samples per iteration, and achieves the improved oracle-complexity of $O(ε^{-3})$ and $O(ε^{-5})$, respectively.

On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation

TL;DR

This work tackles nonconvex bilevel optimization with potentially constrained lower-level problems by introducing a penalty-based reformulation that blends the upper- and lower-level objectives via

and a scaled hyper-objective

. Under a proximal error-bound condition, the authors prove that

closely approximates the original hyper-objective

in both function value and gradient, with

and

, and provide an explicit gradient formula even when the lower level has multiple solutions. They develop two first-order schemes: a double-loop method with large batches and a fully single-loop, momentum-assisted method, achieving non-asymptotic convergence to

-stationary points of the penalized problem with complexities ranging from

in stochastic settings, and

in deterministic settings; the single-loop variant further improves to

under mean-squared smoothness. The framework relies on a proximal-EB landscape regularity to link the penalized and original BO, and it supports extensions to constrained/PL-like LL scenarios, scalable stochastic optimization, and potential nonsmooth objective handling. Overall, the paper advances practical, first-order methods for broad classes of nonconvex bilevel problems with provable nonasymptotic guarantees.

Abstract

. In particular, we establish a strong connection between the penalty function and the hyper-objective by explicitly characterizing the conditions under which the values and derivatives of the two must be

-close. A by-product of our analysis is the explicit formula for the gradient of hyper-objective when the lower-level problem has multiple solutions under minimal conditions, which could be of independent interest. Next, viewing the penalty formulation as

-approximation of the original BO, we propose first-order algorithms that find an

-stationary solution by optimizing the penalty formulation with

. When the perturbed lower-level problem uniformly satisfies the small-error proximal error-bound (EB) condition, we propose a first-order algorithm that converges to an

-stationary point of the penalty function, using in total

and

accesses to first-order (stochastic) gradient oracles when the oracle is deterministic and oracles are noisy, respectively. Under an additional assumption on stochastic oracles, we show that the algorithm can be implemented in a fully {\it single-loop} manner, i.e., with

samples per iteration, and achieves the improved oracle-complexity of

and

, respectively.

Paper Structure (76 sections, 28 theorems, 216 equations, 1 figure, 2 algorithms)

This paper contains 76 sections, 28 theorems, 216 equations, 1 figure, 2 algorithms.

Introduction
Overview of Main Results
Landscape Analysis.
Algorithm.
Related Work
Implicit-Gradient Descent
Nonconvex Lower-Level Objectives
Penalty Methods
Implicit Differentiation Methods
Preliminaries
Constrained Optimization
Other Notation
Landscape Analysis and Penalty Method
Sufficient Conditions for Differentiability
Asymptotic Landscape
...and 61 more sections

Key Result

Theorem 1.1

Under Assumption assumption:small_gradient_proximal_error_bound (with additional smoothness assumptions), for all $x \in \mathcal{X}$ such that at least one sufficiently regular solution path $y^*(x, \sigma)$ exists for $\sigma \in [0,\sigma_0]$, we have

Figures (1)

Figure 1: $\psi(x)$ and $\psi_{\sigma}(x)$ in Examples: (left) Example \ref{['example:bilinear_LL']}, (right) Example \ref{['example:sc_non_smooth']}. Blue dashed lines compare $\psi_{\sigma}(x)$ to the original hyper-objective $\psi(x)$.

Theorems & Definitions (42)

Definition 1
Theorem 1.1: Informal
Definition 2: Hausdorff Distance
Definition 3: Lipschitz Continuity of Solution Sets chen2023bilevel
Definition 4: Active Constraints
Definition 5: Linear Independence Constraint Qualification (LICQ)
Definition 6: Strict Complementarity
Example 1
Example 2
Definition 7
...and 32 more

On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation

TL;DR

Abstract

On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (42)