Table of Contents
Fetching ...

Adaptive Algorithms with Sharp Convergence Rates for Stochastic Hierarchical Optimization

Xiaochuan Gong, Jie Hao, Mingrui Liu

TL;DR

The paper tackles stochastic hierarchical optimization (minimax and bilevel) under unknown gradient-noise levels. It introduces Ada-Minimax and Ada-BiO, which combine momentum normalization with online adaptive momentum and stepsizes to automatically adapt to noise, achieving sharp gradient-norm rates $\widetilde{O}(1/\sqrt{T} + \sqrt{\bar{\sigma}}/T^{1/4})$ in the single-level setting and extending to hierarchical problems. Theoretical results show high-probability convergence bounds, with Ada-Minimax attaining $\widetilde{O}(1/\sqrt{T} + \sqrt{\bar{\sigma}_x}/T^{1/4})$ and Ada-BiO achieving $\widetilde{O}((4\bar{\sigma}_{\phi}^2 + \sigma_{g,1}^2)^{1/4}/T^{1/4})$, while tests on synthetic data and deep learning tasks corroborate adaptivity and robustness. The work provides the first adaptive guarantees for stochastic hierarchical optimization without prior noise knowledge, offering practical improvements for minimax and bilevel learning scenarios in noise-varied environments.

Abstract

Hierarchical optimization refers to problems with interdependent decision variables and objectives, such as minimax and bilevel formulations. While various algorithms have been proposed, existing methods and analyses lack adaptivity in stochastic optimization settings: they cannot achieve optimal convergence rates across a wide spectrum of gradient noise levels without prior knowledge of the noise magnitude. In this paper, we propose novel adaptive algorithms for two important classes of stochastic hierarchical optimization problems: nonconvex-strongly-concave minimax optimization and nonconvex-strongly-convex bilevel optimization. Our algorithms achieve sharp convergence rates of $\widetilde{O}(1/\sqrt{T} + \sqrt{\barσ}/T^{1/4})$ in $T$ iterations for the gradient norm, where $\barσ$ is an upper bound on the stochastic gradient noise. Notably, these rates are obtained without prior knowledge of the noise level, thereby enabling automatic adaptivity in both low and high-noise regimes. To our knowledge, this work provides the first adaptive and sharp convergence guarantees for stochastic hierarchical optimization. Our algorithm design combines the momentum normalization technique with novel adaptive parameter choices. Extensive experiments on synthetic and deep learning tasks demonstrate the effectiveness of our proposed algorithms.

Adaptive Algorithms with Sharp Convergence Rates for Stochastic Hierarchical Optimization

TL;DR

The paper tackles stochastic hierarchical optimization (minimax and bilevel) under unknown gradient-noise levels. It introduces Ada-Minimax and Ada-BiO, which combine momentum normalization with online adaptive momentum and stepsizes to automatically adapt to noise, achieving sharp gradient-norm rates in the single-level setting and extending to hierarchical problems. Theoretical results show high-probability convergence bounds, with Ada-Minimax attaining and Ada-BiO achieving , while tests on synthetic data and deep learning tasks corroborate adaptivity and robustness. The work provides the first adaptive guarantees for stochastic hierarchical optimization without prior noise knowledge, offering practical improvements for minimax and bilevel learning scenarios in noise-varied environments.

Abstract

Hierarchical optimization refers to problems with interdependent decision variables and objectives, such as minimax and bilevel formulations. While various algorithms have been proposed, existing methods and analyses lack adaptivity in stochastic optimization settings: they cannot achieve optimal convergence rates across a wide spectrum of gradient noise levels without prior knowledge of the noise magnitude. In this paper, we propose novel adaptive algorithms for two important classes of stochastic hierarchical optimization problems: nonconvex-strongly-concave minimax optimization and nonconvex-strongly-convex bilevel optimization. Our algorithms achieve sharp convergence rates of in iterations for the gradient norm, where is an upper bound on the stochastic gradient noise. Notably, these rates are obtained without prior knowledge of the noise level, thereby enabling automatic adaptivity in both low and high-noise regimes. To our knowledge, this work provides the first adaptive and sharp convergence guarantees for stochastic hierarchical optimization. Our algorithm design combines the momentum normalization technique with novel adaptive parameter choices. Extensive experiments on synthetic and deep learning tasks demonstrate the effectiveness of our proposed algorithms.

Paper Structure

This paper contains 37 sections, 35 theorems, 179 equations, 5 figures, 2 tables, 3 algorithms.

Key Result

Theorem 4.1

Under ass:smoothness_minimaxass:noise_minimax and the parameter choices in eq:alpha_minimaxeq:eta_minimax, let $\bar{\sigma}_x=\sigma_y$, then for any $\delta\in(0, 1/7)$, it holds with probability at least $1-7\delta$ that where $C_{m}=\widetilde{O}(\kappa_{\sigma}^4)$ and $D$ are defined in eq:C_minimaxeq:D, respectively.

Figures (5)

  • Figure 1: Synthetic experiments on a 1-dimensional function for minimax optimization.
  • Figure 2: 2-layer Transformer for deep AUC maximization on imbalanced Sentiment140 dataset.
  • Figure 3: Robustness of hyperparameters.
  • Figure 4: Comparison of BERT model on hyperparameter optimization.
  • Figure : Adaptive Algorithm for Minimax Optimization (Ada-Minimax)

Theorems & Definitions (56)

  • Theorem 4.1
  • Theorem 4.2
  • Lemma 5.2
  • Lemma 5.2
  • Lemma 5.2
  • Theorem 5.3
  • Lemma 5.3
  • Lemma A.1: carmon2022making
  • Lemma A.2: liu2023near
  • proof : Proof of \ref{['lem:MDS']}
  • ...and 46 more