Table of Contents
Fetching ...

Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis

Jie Hao, Xiaochuan Gong, Mingrui Liu

TL;DR

This work tackles bilevel optimization when the upper-level objective exhibits unbounded smoothness, a setting where existing methods struggle. It introduces BO-REP, which combines normalized momentum for stable upper-level updates with initialization refinement and periodic lower-level updates to tightly control hypergradient bias. The authors prove a $\widetilde{O}(1/\epsilon^4)$ stochastic iteration complexity to obtain an $\epsilon$-stationary point when the lower level is strongly convex, matching bounded-smoothness results up to log factors, and validate the approach on hyper-representation learning, hyperparameter optimization, and data hyper-cleaning tasks. The results indicate that BO-REP provides both theoretical guarantees and practical gains for challenging bilevel problems in text-classification contexts.

Abstract

Bilevel optimization is an important formulation for many machine learning problems. Current bilevel optimization algorithms assume that the gradient of the upper-level function is Lipschitz. However, recent studies reveal that certain neural networks such as recurrent neural networks (RNNs) and long-short-term memory networks (LSTMs) exhibit potential unbounded smoothness, rendering conventional bilevel optimization algorithms unsuitable. In this paper, we design a new bilevel optimization algorithm, namely BO-REP, to address this challenge. This algorithm updates the upper-level variable using normalized momentum and incorporates two novel techniques for updating the lower-level variable: \textit{initialization refinement} and \textit{periodic updates}. Specifically, once the upper-level variable is initialized, a subroutine is invoked to obtain a refined estimate of the corresponding optimal lower-level variable, and the lower-level variable is updated only after every specific period instead of each iteration. When the upper-level problem is nonconvex and unbounded smooth, and the lower-level problem is strongly convex, we prove that our algorithm requires $\widetilde{\mathcal{O}}(1/ε^4)$ iterations to find an $ε$-stationary point in the stochastic setting, where each iteration involves calling a stochastic gradient or Hessian-vector product oracle. Notably, this result matches the state-of-the-art complexity results under the bounded smoothness setting and without mean-squared smoothness of the stochastic gradient, up to logarithmic factors. Our proof relies on novel technical lemmas for the periodically updated lower-level variable, which are of independent interest. Our experiments on hyper-representation learning, hyperparameter optimization, and data hyper-cleaning for text classification tasks demonstrate the effectiveness of our proposed algorithm.

Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis

TL;DR

This work tackles bilevel optimization when the upper-level objective exhibits unbounded smoothness, a setting where existing methods struggle. It introduces BO-REP, which combines normalized momentum for stable upper-level updates with initialization refinement and periodic lower-level updates to tightly control hypergradient bias. The authors prove a stochastic iteration complexity to obtain an -stationary point when the lower level is strongly convex, matching bounded-smoothness results up to log factors, and validate the approach on hyper-representation learning, hyperparameter optimization, and data hyper-cleaning tasks. The results indicate that BO-REP provides both theoretical guarantees and practical gains for challenging bilevel problems in text-classification contexts.

Abstract

Bilevel optimization is an important formulation for many machine learning problems. Current bilevel optimization algorithms assume that the gradient of the upper-level function is Lipschitz. However, recent studies reveal that certain neural networks such as recurrent neural networks (RNNs) and long-short-term memory networks (LSTMs) exhibit potential unbounded smoothness, rendering conventional bilevel optimization algorithms unsuitable. In this paper, we design a new bilevel optimization algorithm, namely BO-REP, to address this challenge. This algorithm updates the upper-level variable using normalized momentum and incorporates two novel techniques for updating the lower-level variable: \textit{initialization refinement} and \textit{periodic updates}. Specifically, once the upper-level variable is initialized, a subroutine is invoked to obtain a refined estimate of the corresponding optimal lower-level variable, and the lower-level variable is updated only after every specific period instead of each iteration. When the upper-level problem is nonconvex and unbounded smooth, and the lower-level problem is strongly convex, we prove that our algorithm requires iterations to find an -stationary point in the stochastic setting, where each iteration involves calling a stochastic gradient or Hessian-vector product oracle. Notably, this result matches the state-of-the-art complexity results under the bounded smoothness setting and without mean-squared smoothness of the stochastic gradient, up to logarithmic factors. Our proof relies on novel technical lemmas for the periodically updated lower-level variable, which are of independent interest. Our experiments on hyper-representation learning, hyperparameter optimization, and data hyper-cleaning for text classification tasks demonstrate the effectiveness of our proposed algorithm.
Paper Structure (11 sections, 6 theorems, 8 equations, 1 figure, 1 table, 3 algorithms)

This paper contains 11 sections, 6 theorems, 8 equations, 1 figure, 1 table, 3 algorithms.

Key Result

theorem 1

Suppose Assumptions ass:relaxedsmooth, ass:f_g_property and ass:stochastic hold. Run Algorithm alg:blue for $K$ iterations and let $\{\vx_k\}_{k\geq0}$ be the sequence produced by Algorithm alg:blue. For $\epsilon \leq \min\left(\frac{K_0}{K_1},\sqrt{\frac{\sigma_{f,1}^2 + \frac{2M^2}{\mu^2}\sigma_{

Figures (1)

  • Figure 1: Comparison of various bilevel optimization algorithms on three applications: (a) results of Hyper-representation on Amazon Review Dataset. (b) results of hyperparameter optimization on Amazon Review Dataset. (c) results of data hyper-cleaning on Sentiment140 Dataset with noise rate $p=0.3$.

Theorems & Definitions (7)

  • definition 1: $\epsilon$-stationary points
  • theorem 1
  • lemma 1: Initialization Refinement
  • lemma 2: Periodic Updates
  • lemma 3: Error Control for the Lower-level Problem
  • lemma 4: Bias of the Hypergradient Estimator
  • lemma 5: Expected Error of the Moving-Average Hypergradient Estimator