On the Complexity of First-Order Methods in Stochastic Bilevel Optimization

Jeongyeol Kwon; Dohyun Kwon; Hanbaek Lyu

On the Complexity of First-Order Methods in Stochastic Bilevel Optimization

Jeongyeol Kwon, Dohyun Kwon, Hanbaek Lyu

TL;DR

This work analyzes the fundamental complexity of finding stationary points in stochastic bilevel optimization with a strongly convex lower level, under a $y^*(x)$-aware oracle that provides an $O(\epsilon)$-accurate lower-level solution and locally unbiased gradients. It introduces a penalty-based approach that reduces the bilevel problem to a single-level surrogate $\mathcal{L}_{\lambda}^*(x)$ and leverages inner-outer loop schemes to control bias and variance, achieving $O(\epsilon^{-6})$ complexity without stochastic smoothness and $O(\epsilon^{-4})$ with it, for $\lambda=O(\epsilon^{-1})$. The paper also derives matching lower bounds $\Omega(\epsilon^{-6})$ and $\Omega(\epsilon^{-4})$ via probabilistic zero-chains, demonstrating that any algorithm playing against a $y^*(x)$-aware oracle cannot surpass these rates under the stated assumptions. By connecting the upper and lower bounds, the results reveal that first-order methods, under mild smoothness conditions, can match certain second-order benchmarks in bilevel settings and establish tight complexity barriers for this oracle model. The findings have implications for the design of efficient bilevel optimization algorithms in meta-learning, hyperparameter tuning, and adversarial learning, clarifying when a $y^*(x)$-aware oracle can yield substantial gains and when intrinsic difficulty limits progress.

Abstract

We consider the problem of finding stationary points in Bilevel optimization when the lower-level problem is unconstrained and strongly convex. The problem has been extensively studied in recent years; the main technical challenge is to keep track of lower-level solutions $y^*(x)$ in response to the changes in the upper-level variables $x$. Subsequently, all existing approaches tie their analyses to a genie algorithm that knows lower-level solutions and, therefore, need not query any points far from them. We consider a dual question to such approaches: suppose we have an oracle, which we call $y^*$-aware, that returns an $O(ε)$-estimate of the lower-level solution, in addition to first-order gradient estimators {\it locally unbiased} within the $Θ(ε)$-ball around $y^*(x)$. We study the complexity of finding stationary points with such an $y^*$-aware oracle: we propose a simple first-order method that converges to an $ε$ stationary point using $O(ε^{-6}), O(ε^{-4})$ access to first-order $y^*$-aware oracles. Our upper bounds also apply to standard unbiased first-order oracles, improving the best-known complexity of first-order methods by $O(ε)$ with minimal assumptions. We then provide the matching $Ω(ε^{-6})$, $Ω(ε^{-4})$ lower bounds without and with an additional smoothness assumption on $y^*$-aware oracles, respectively. Our results imply that any approach that simulates an algorithm with an $y^*$-aware oracle must suffer the same lower bounds.

On the Complexity of First-Order Methods in Stochastic Bilevel Optimization

TL;DR

This work analyzes the fundamental complexity of finding stationary points in stochastic bilevel optimization with a strongly convex lower level, under a

-aware oracle that provides an

-accurate lower-level solution and locally unbiased gradients. It introduces a penalty-based approach that reduces the bilevel problem to a single-level surrogate

and leverages inner-outer loop schemes to control bias and variance, achieving

complexity without stochastic smoothness and

with it, for

. The paper also derives matching lower bounds

and

via probabilistic zero-chains, demonstrating that any algorithm playing against a

-aware oracle cannot surpass these rates under the stated assumptions. By connecting the upper and lower bounds, the results reveal that first-order methods, under mild smoothness conditions, can match certain second-order benchmarks in bilevel settings and establish tight complexity barriers for this oracle model. The findings have implications for the design of efficient bilevel optimization algorithms in meta-learning, hyperparameter tuning, and adversarial learning, clarifying when a

-aware oracle can yield substantial gains and when intrinsic difficulty limits progress.

Abstract

in response to the changes in the upper-level variables

. Subsequently, all existing approaches tie their analyses to a genie algorithm that knows lower-level solutions and, therefore, need not query any points far from them. We consider a dual question to such approaches: suppose we have an oracle, which we call

-aware, that returns an

-estimate of the lower-level solution, in addition to first-order gradient estimators {\it locally unbiased} within the

-ball around

. We study the complexity of finding stationary points with such an

-aware oracle: we propose a simple first-order method that converges to an

stationary point using

access to first-order

-aware oracles. Our upper bounds also apply to standard unbiased first-order oracles, improving the best-known complexity of first-order methods by

with minimal assumptions. We then provide the matching

lower bounds without and with an additional smoothness assumption on

-aware oracles, respectively. Our results imply that any approach that simulates an algorithm with an

-aware oracle must suffer the same lower bounds.

Paper Structure (58 sections, 29 theorems, 149 equations, 1 algorithm)

This paper contains 58 sections, 29 theorems, 149 equations, 1 algorithm.

Introduction
$y^*(x)$-Aware Oracles.
Prior Art.
Overview of Main Results
Upper Bound.
Lower Bound.
Our Approach
Upper Bound: Penalty Method
Lower Bound: Slower Progress
Related Work
Upper Bounds for Bilevel Optimization.
Lower Bounds for Bilevel Optimization.
Stochastic Nonconvex Optimization.
Preliminaries
Oracle Classes.
...and 43 more sections

Key Result

Theorem 3.1

Suppose Assumptions assumption:nice_functions and assumption:hessian_lipschitz_g hold and let $\lambda = \max\left( \frac{\lambda_0}{\epsilon}, \frac{6 l_{f,0}}{\mu_g r} \right) \asymp \epsilon^{-1}$, $r_{\lambda} = \frac{l_{f,0}}{\mu_g \lambda}$ where $\lambda_0 := \frac{4 l_{f,0} l_{g,1}}{\mu_g^2}

Theorems & Definitions (33)

Definition 1.1: $y^*$-Aware Oracle
Theorem 3.1
Theorem 3.2
Lemma 3.3
Proposition 3.4
Proposition 3.5
Definition 4.1
Definition 4.2
Lemma 4.3
Lemma 4.4
...and 23 more

On the Complexity of First-Order Methods in Stochastic Bilevel Optimization

TL;DR

Abstract

On the Complexity of First-Order Methods in Stochastic Bilevel Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (33)