Table of Contents
Fetching ...

TiAda: A Time-scale Adaptive Algorithm for Nonconvex Minimax Optimization

Xiang Li, Junchi Yang, Niao He

TL;DR

This work proposes a single-loop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the time-scale separation and can achieve near-optimal complexities simultaneously in deterministic and stochastic settings of non Convex-strongly-concave minimax problems.

Abstract

Adaptive gradient methods have shown their ability to adjust the stepsizes on the fly in a parameter-agnostic manner, and empirically achieve faster convergence for solving minimization problems. When it comes to nonconvex minimax optimization, however, current convergence analyses of gradient descent ascent (GDA) combined with adaptive stepsizes require careful tuning of hyper-parameters and the knowledge of problem-dependent parameters. Such a discrepancy arises from the primal-dual nature of minimax problems and the necessity of delicate time-scale separation between the primal and dual updates in attaining convergence. In this work, we propose a single-loop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the time-scale separation. Our algorithm is fully parameter-agnostic and can achieve near-optimal complexities simultaneously in deterministic and stochastic settings of nonconvex-strongly-concave minimax problems. The effectiveness of the proposed method is further justified numerically for a number of machine learning applications.

TiAda: A Time-scale Adaptive Algorithm for Nonconvex Minimax Optimization

TL;DR

This work proposes a single-loop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the time-scale separation and can achieve near-optimal complexities simultaneously in deterministic and stochastic settings of non Convex-strongly-concave minimax problems.

Abstract

Adaptive gradient methods have shown their ability to adjust the stepsizes on the fly in a parameter-agnostic manner, and empirically achieve faster convergence for solving minimization problems. When it comes to nonconvex minimax optimization, however, current convergence analyses of gradient descent ascent (GDA) combined with adaptive stepsizes require careful tuning of hyper-parameters and the knowledge of problem-dependent parameters. Such a discrepancy arises from the primal-dual nature of minimax problems and the necessity of delicate time-scale separation between the primal and dual updates in attaining convergence. In this work, we propose a single-loop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the time-scale separation. Our algorithm is fully parameter-agnostic and can achieve near-optimal complexities simultaneously in deterministic and stochastic settings of nonconvex-strongly-concave minimax problems. The effectiveness of the proposed method is further justified numerically for a number of machine learning applications.
Paper Structure (30 sections, 13 theorems, 29 equations, 7 figures, 1 table, 2 algorithms)

This paper contains 30 sections, 13 theorems, 29 equations, 7 figures, 1 table, 2 algorithms.

Key Result

Theorem 3.1

Under assume:strong-convexassume:smoothnessassume:interior_optimal, alg:tiada with deterministic gradient oracles satisfies that for any $0 < \beta < \alpha < 1$, after $T$ iterations,

Figures (7)

  • Figure 1: Comparison between TiAda and vanilla GDA with AdaGrad stepsizes (labeled as AdaGrad) on the quadratic function \ref{['eq:quad']} with $L=2$ under a poor initial stepsize ratio, i.e., $\eta^x / \eta^y = 5$. Here, $\eta^x_t$ and $\eta^y_t$ are the effective stepsizes respectively for $x$ and $y$, and $\kappa$ is the condition number. (a) shows the trajectory of the two algorithms and the background color demonstrates the function value $f(x, y)$. In (b), while the effective stepsize ratio stays unchanged for AdaGrad, TiAda adapts to the desired time-scale separation$1/\kappa$, which divides the training process into two stages. In (c), after entering Stage II, TiAda converges fast, whereas AdaGrad diverges.
  • Figure 2: Comparison of algorithms on test functions. $r=\eta^x/\eta^y$ is the initial stepsize ratio. In the first row, we use the quadratic function \ref{['eq:quad']} with $L=2$ under deterministic gradient oracles. For the second row, we test the methods on the McCormick function with noisy gradients.
  • Figure 3: Comparison of the algorithms on distributional robustness optimization \ref{['eq:dist_robust']}. We use $i$ in the legend to indicate the number of inner loops. Here we present two sets of stepsize configurations for the comparisons of AdaGrad-like and Adam-like algorithms. Please refer to \ref{['sec:add_exp']} for extensive experiments on larger ranges of stepsizes, and it will be shown that TiAda is the best among all stepsize combinations in our grid.
  • Figure 4: Inception score on WGAN-GP.
  • Figure 5: Illustration of the effect of $\alpha$ and $\beta$ on the two stages in TiAda's time-scale adaptation process. We set $\beta=1 - \alpha$. The dashed line on the right plot represents the first iteration when the effective stepsize ratio is below $1/\kappa$.
  • ...and 2 more figures

Theorems & Definitions (18)

  • Remark 3.1
  • Theorem 3.1: deterministic setting
  • Remark 3.2
  • Remark 3.3
  • Theorem 3.2: stochastic setting
  • Remark 3.4
  • Lemma B.1: Lemma A.2 in yang2022nest
  • Lemma B.2: smoothness of $\Phi(\cdot)$ and Lipschitzness of $y^*(\cdot)$. Lemma 4.3 in lin2020gradient
  • Lemma B.3: smoothness of $y^*(\cdot)$. Lemma 2 in chen2021closing
  • Theorem C.1: deterministic setting
  • ...and 8 more