An Accelerated Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness

Xiaochuan Gong; Jie Hao; Mingrui Liu

An Accelerated Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness

Xiaochuan Gong, Jie Hao, Mingrui Liu

TL;DR

This work tackles stochastic bilevel optimization where the upper-level objective is nonconvex with potentially unbounded smoothness and the lower-level objective is strongly convex. The authors introduce AccBO, which combines normalized stochastic gradient descent with recursive momentum for the upper level and stochastic Nesterov accelerated gradient descent with averaging for the lower level, under a distributional-drift framework. They prove a high-probability convergence result achieving an ε-stationary point in $ ilde{O}(ε^{-3})$ oracle calls when the lower-level gradient variance is $O(ε)$, representing a significant acceleration over prior $ ilde{O}(ε^{-4})$ rates. A key technical contribution is a novel high-probability analysis of SNAG under distribution drift, which also informs the hypergradient estimation error in bilevel optimization. Empirical results on deep AUC maximization and data hyper-cleaning corroborate the theoretical acceleration and show AccBO outperforms existing bilevel baselines.

Abstract

This paper investigates a class of stochastic bilevel optimization problems where the upper-level function is nonconvex with potentially unbounded smoothness and the lower-level problem is strongly convex. These problems have significant applications in sequential data learning, such as text classification using recurrent neural networks. The unbounded smoothness is characterized by the smoothness constant of the upper-level function scaling linearly with the gradient norm, lacking a uniform upper bound. Existing state-of-the-art algorithms require $\widetilde{O}(1/ε^4)$ oracle calls of stochastic gradient or Hessian/Jacobian-vector product to find an $ε$-stationary point. However, it remains unclear if we can further improve the convergence rate when the assumptions for the function in the population level also hold for each random realization almost surely. To address this issue, we propose a new Accelerated Bilevel Optimization algorithm named AccBO. The algorithm updates the upper-level variable by normalized stochastic gradient descent with recursive momentum and the lower-level variable by the stochastic Nesterov accelerated gradient descent algorithm with averaging. We prove that our algorithm achieves an oracle complexity of $\widetilde{O}(1/ε^3)$ to find an $ε$-stationary point, when the lower-level stochastic gradient's variance is $O(ε)$. Our proof relies on a novel lemma characterizing the dynamics of stochastic Nesterov accelerated gradient descent algorithm under distribution drift with high probability for the lower-level variable, which is of independent interest and also plays a crucial role in analyzing the hypergradient estimation error over time. Experimental results on various tasks confirm that our proposed algorithm achieves the predicted theoretical acceleration and significantly outperforms baselines in bilevel optimization.

An Accelerated Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness

TL;DR

oracle calls when the lower-level gradient variance is

, representing a significant acceleration over prior

rates. A key technical contribution is a novel high-probability analysis of SNAG under distribution drift, which also informs the hypergradient estimation error in bilevel optimization. Empirical results on deep AUC maximization and data hyper-cleaning corroborate the theoretical acceleration and show AccBO outperforms existing bilevel baselines.

Abstract

oracle calls of stochastic gradient or Hessian/Jacobian-vector product to find an

-stationary point. However, it remains unclear if we can further improve the convergence rate when the assumptions for the function in the population level also hold for each random realization almost surely. To address this issue, we propose a new Accelerated Bilevel Optimization algorithm named AccBO. The algorithm updates the upper-level variable by normalized stochastic gradient descent with recursive momentum and the lower-level variable by the stochastic Nesterov accelerated gradient descent algorithm with averaging. We prove that our algorithm achieves an oracle complexity of

to find an

-stationary point, when the lower-level stochastic gradient's variance is

. Our proof relies on a novel lemma characterizing the dynamics of stochastic Nesterov accelerated gradient descent algorithm under distribution drift with high probability for the lower-level variable, which is of independent interest and also plays a crucial role in analyzing the hypergradient estimation error over time. Experimental results on various tasks confirm that our proposed algorithm achieves the predicted theoretical acceleration and significantly outperforms baselines in bilevel optimization.

Paper Structure (38 sections, 29 theorems, 164 equations, 2 figures, 2 algorithms)

This paper contains 38 sections, 29 theorems, 164 equations, 2 figures, 2 algorithms.

Introduction
Related Work
Problem Setup and Preliminaries
Algorithm and Analysis
Main Challenges and Algorithm Design
Main Results
Proof Sketch
Stochastic Nesterov Accelerated Gradient Descent under Distributional Drift
Application of Stochastic Nesterov Accelerated Gradient to Bilevel Optimization
Experiments
Conclusion
Technical Lemmas
Auxiliary Lemmas for Bilevel Optimization
Proofs of Results in \ref{['sec:nesterov']}
Bounding $(A)$.
...and 23 more sections

Key Result

Theorem 4.1

Suppose ass:relax-smoothass:f-and-gass:noiseass:individual-noise hold. Let $\{x_t\}$ be the iterates produced by alg:bilevel. For any given $\delta\in(0,1)$ and small enough $\epsilon$ (see exact choice in eq:thm-eps), if $\sigma_{g,1} = O(\sqrt{\epsilon})$ as defined in eq:sigma-g1, and we set para Then with probability at least $1-2\delta$ over the randomness in $\sigma({\mathcal{F}}^{{\mathrm{i

Figures (2)

Figure 1: Results of bilevel optimization on deep AUC maximization. Figure (a), (b) are the results over epochs, and Figure (c), (d) are the results over running time.
Figure 2: Results of bilevel optimization on data hyper-cleaning with $p=0.1$. Figure (a), (b), (c), (d) are the results with noise rate $p=0.1$ where (a), (b) are the results over epochs, and Figure (c), (d) are the results over running time. Figure (e), (f), (g), (h) are the results with noise rate $p=0.2$.

Theorems & Definitions (45)

Theorem 4.1
Lemma 4.3
Lemma 4.4: Warm-start
Lemma 4.5: Option I
Lemma 4.6: Option II
Lemma 4.7: Averaging
Lemma 4.8
Lemma A.1: Recursive control on MGF
proof : Proof of \ref{['lm:recursive-control']}
Lemma A.2: Young's inequality
...and 35 more

An Accelerated Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness

TL;DR

Abstract

An Accelerated Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (45)