Table of Contents
Fetching ...

A Nearly Optimal Single Loop Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness

Xiaochuan Gong, Jie Hao, Mingrui Liu

TL;DR

This work addresses stochastic bilevel optimization with a nonconvex upper-level objective that may exhibit unbounded smoothness and a strongly convex lower level. It proposes SLIP, a single-loop optimizer that performs a few SGD steps on the lower level before and during concurrent updates of the upper-level variable, using normalization and momentum to stabilize progress. The authors prove that SLIP converges to an $\epsilon$-stationary point in $\widetilde{O}(1/\epsilon^4)$ oracle calls, with both expectation and high-probability guarantees, and provide a novel connection between bilevel optimization and stochastic optimization under distributional drift. Empirical results on hyper-representation learning and data hyper-cleaning demonstrate substantial speedups over strong baselines, confirming the method’s practicality and impact for meta-learning and related tasks.

Abstract

This paper studies the problem of stochastic bilevel optimization where the upper-level function is nonconvex with potentially unbounded smoothness and the lower-level function is strongly convex. This problem is motivated by meta-learning applied to sequential data, such as text classification using recurrent neural networks, where the smoothness constant of the upper-level loss function scales linearly with the gradient norm and can be potentially unbounded. Existing algorithm crucially relies on the nested loop design, which requires significant tuning efforts and is not practical. In this paper, we address this issue by proposing a Single Loop bIlevel oPtimizer (SLIP). The proposed algorithm first updates the lower-level variable by a few steps of stochastic gradient descent, and then simultaneously updates the upper-level variable by normalized stochastic gradient descent with momentum and the lower-level variable by stochastic gradient descent. Under standard assumptions, we show that our algorithm finds an $ε$-stationary point within $\widetilde{O}(1/ε^4)$\footnote{Here $\widetilde{O}(\cdot)$ compresses logarithmic factors of $1/ε$ and $1/δ$, where $δ\in(0,1)$ denotes the failure probability.} oracle calls of stochastic gradient or Hessian-vector product, both in expectation and with high probability. This complexity result is nearly optimal up to logarithmic factors without mean-square smoothness of the stochastic gradient oracle. Our proof relies on (i) a refined characterization and control of the lower-level variable and (ii) establishing a novel connection between bilevel optimization and stochastic optimization under distributional drift. Our experiments on various tasks show that our algorithm significantly outperforms strong baselines in bilevel optimization.

A Nearly Optimal Single Loop Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness

TL;DR

This work addresses stochastic bilevel optimization with a nonconvex upper-level objective that may exhibit unbounded smoothness and a strongly convex lower level. It proposes SLIP, a single-loop optimizer that performs a few SGD steps on the lower level before and during concurrent updates of the upper-level variable, using normalization and momentum to stabilize progress. The authors prove that SLIP converges to an -stationary point in oracle calls, with both expectation and high-probability guarantees, and provide a novel connection between bilevel optimization and stochastic optimization under distributional drift. Empirical results on hyper-representation learning and data hyper-cleaning demonstrate substantial speedups over strong baselines, confirming the method’s practicality and impact for meta-learning and related tasks.

Abstract

This paper studies the problem of stochastic bilevel optimization where the upper-level function is nonconvex with potentially unbounded smoothness and the lower-level function is strongly convex. This problem is motivated by meta-learning applied to sequential data, such as text classification using recurrent neural networks, where the smoothness constant of the upper-level loss function scales linearly with the gradient norm and can be potentially unbounded. Existing algorithm crucially relies on the nested loop design, which requires significant tuning efforts and is not practical. In this paper, we address this issue by proposing a Single Loop bIlevel oPtimizer (SLIP). The proposed algorithm first updates the lower-level variable by a few steps of stochastic gradient descent, and then simultaneously updates the upper-level variable by normalized stochastic gradient descent with momentum and the lower-level variable by stochastic gradient descent. Under standard assumptions, we show that our algorithm finds an -stationary point within \footnote{Here compresses logarithmic factors of and , where denotes the failure probability.} oracle calls of stochastic gradient or Hessian-vector product, both in expectation and with high probability. This complexity result is nearly optimal up to logarithmic factors without mean-square smoothness of the stochastic gradient oracle. Our proof relies on (i) a refined characterization and control of the lower-level variable and (ii) establishing a novel connection between bilevel optimization and stochastic optimization under distributional drift. Our experiments on various tasks show that our algorithm significantly outperforms strong baselines in bilevel optimization.
Paper Structure (44 sections, 32 theorems, 188 equations, 5 figures, 1 table, 2 algorithms)

This paper contains 44 sections, 32 theorems, 188 equations, 5 figures, 1 table, 2 algorithms.

Key Result

Theorem 4.1

Suppose Assumptions ass:relax-smooth and ass:f-and-g hold. Let $\{x_t\}$ be the iterates produced by Algorithm alg:bilevel. For any given $\delta\in(0,1)$ and sufficiently small ${\epsilon}$ (see the exact choice of ${\epsilon}$ in eq:eps), if we choose $\alpha^{{\mathsf{init}}}, \alpha,\beta,\gamma where $\Delta_0, A$ and $B$ are defined in eq:ABdelta, then with probability at least $1-2\delta$ o

Figures (5)

  • Figure 1: Comparison with bilevel optimization baselines on Hyper-representation. Figure (a) and (b) are the results in the SNLI dataset. Figures (c) and (d) are the results of the Amazon Review Dataset (ARD).
  • Figure 2: Comparison with bilevel optimization baselinses on data hyper-cleaning. Figure (a), (b) are the results with the corruption rate $p=0.2$. Figure (c), (d) are the results with the corruption rate $p=0.4$.
  • Figure 3: Comparison on running time. (a) Results of Hyper-representation on SNLI dataset. (b) Results of Hyper-representation on Amazon Review Dataset (ARD). (c), (d) Results of data Hyper-cleaning on Sentiment140 with corruption rate $p=0.2$ and $p=0.4$.
  • Figure 4: Comparison with bilevel optimization baselines on Hyper-representation. Figure (a) and (b) are the results in the SNLI dataset. Figures (c) and (d) are the results of the Amazon Review Dataset (ARD).
  • Figure 5: Comparison with bilevel optimization baselinses on data hyper-cleaning. Figure (a), (b) are the results with the corruption rate $p=0.2$. Figure (c), (d) are the results with the corruption rate $p=0.4$.

Theorems & Definitions (50)

  • Theorem 4.1
  • Theorem 4.3
  • Lemma 4.4
  • Lemma 4.5: Warm-start
  • Lemma 4.6
  • Lemma 4.7
  • Definition 2.1: zhang2020gradient
  • Definition 2.2: Remark 2.3 in zhang2020improved
  • Lemma 2.3: Lemma 6 in hao2024bilevel
  • Lemma 3.1: Hypergradient formula, Lemma 7 in hao2024bilevel
  • ...and 40 more