On the Convergence of Adam-Type Algorithm for Bilevel Optimization under Unbounded Smoothness
Xiaochuan Gong, Jie Hao, Mingrui Liu
TL;DR
This work addresses bilevel optimization under unbounded smoothness by introducing AdamBO, a single-loop Adam-type method that updates the upper-level variable with Adam while warm-starting and updating the lower-level variable via SGD. The authors establish that AdamBO achieves ~O(b5^{-4}) oracle complexity to reach an b5-stationary point, despite non-negligible hypergradient bias arising from lower-level estimation, by proving a novel randomness decoupling lemma that separates upper- and lower-level randomness. The method is complemented by a refined analysis of hypergradient bias and a stopping-time framework, enabling high-probability convergence guarantees. Empirical results on meta-learning with hyper-representations and deep AUC maximization with RNNs/Transformers demonstrate faster convergence and substantial performance gains over existing bilevel baselines, underscoring the practical impact of adaptive, Adam-style updates in bilevel contexts. The work also discusses the dependence on the smoothing parameter bbeta and blambda, pointing to future directions to reduce this dependence while preserving strong empirical performance.
Abstract
Adam has become one of the most popular optimizers for training modern deep neural networks, such as transformers. However, its applicability is largely restricted to single-level optimization problems. In this paper, we aim to extend vanilla Adam to tackle bilevel optimization problems, which have important applications in machine learning, such as meta-learning. In particular, we study stochastic bilevel optimization problems where the lower-level function is strongly convex and the upper-level objective is nonconvex with potentially unbounded smoothness. This unbounded smooth objective function covers a broad class of neural networks, including transformers, which may exhibit non-Lipschitz gradients. In this work, we introduce AdamBO, a single-loop Adam-type method that achieves $\widetilde{O}(ε^{-4})$ oracle complexity to find $ε$-stationary points, where the oracle calls involve stochastic gradient or Hessian/Jacobian-vector product evaluations. The key to our analysis is a novel randomness decoupling lemma that provides refined control over the lower-level variable. We conduct extensive experiments on various machine learning tasks involving bilevel formulations with recurrent neural networks (RNNs) and transformers, demonstrating the effectiveness of our proposed Adam-type algorithm.
