Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis
Jie Hao, Xiaochuan Gong, Mingrui Liu
TL;DR
This work tackles bilevel optimization when the upper-level objective exhibits unbounded smoothness, a setting where existing methods struggle. It introduces BO-REP, which combines normalized momentum for stable upper-level updates with initialization refinement and periodic lower-level updates to tightly control hypergradient bias. The authors prove a $\widetilde{O}(1/\epsilon^4)$ stochastic iteration complexity to obtain an $\epsilon$-stationary point when the lower level is strongly convex, matching bounded-smoothness results up to log factors, and validate the approach on hyper-representation learning, hyperparameter optimization, and data hyper-cleaning tasks. The results indicate that BO-REP provides both theoretical guarantees and practical gains for challenging bilevel problems in text-classification contexts.
Abstract
Bilevel optimization is an important formulation for many machine learning problems. Current bilevel optimization algorithms assume that the gradient of the upper-level function is Lipschitz. However, recent studies reveal that certain neural networks such as recurrent neural networks (RNNs) and long-short-term memory networks (LSTMs) exhibit potential unbounded smoothness, rendering conventional bilevel optimization algorithms unsuitable. In this paper, we design a new bilevel optimization algorithm, namely BO-REP, to address this challenge. This algorithm updates the upper-level variable using normalized momentum and incorporates two novel techniques for updating the lower-level variable: \textit{initialization refinement} and \textit{periodic updates}. Specifically, once the upper-level variable is initialized, a subroutine is invoked to obtain a refined estimate of the corresponding optimal lower-level variable, and the lower-level variable is updated only after every specific period instead of each iteration. When the upper-level problem is nonconvex and unbounded smooth, and the lower-level problem is strongly convex, we prove that our algorithm requires $\widetilde{\mathcal{O}}(1/ε^4)$ iterations to find an $ε$-stationary point in the stochastic setting, where each iteration involves calling a stochastic gradient or Hessian-vector product oracle. Notably, this result matches the state-of-the-art complexity results under the bounded smoothness setting and without mean-squared smoothness of the stochastic gradient, up to logarithmic factors. Our proof relies on novel technical lemmas for the periodically updated lower-level variable, which are of independent interest. Our experiments on hyper-representation learning, hyperparameter optimization, and data hyper-cleaning for text classification tasks demonstrate the effectiveness of our proposed algorithm.
