Table of Contents
Fetching ...

Achieving Near-Optimal Convergence for Distributed Minimax Optimization with Adaptive Stepsizes

Yan Huang, Xiang Li, Yipeng Shen, Niao He, Jinming Xu

TL;DR

This work tackles distributed nonconvex-strongly-concave minimax problems where adaptive stepsizes can fail to converge due to inter-node inconsistency. It introduces D-AdaST, a distributed adaptive minimax method with a two-scalar stepsize-tracking mechanism that enforces cross-node consistency and time-scale separation without requiring prior knowledge of problem parameters. The authors prove a near-optimal convergence rate of $\tilde{\mathcal{O}}(\epsilon^{-(4+\delta)})$ for any $\delta>0$ and characterize transient times for network independence, demonstrating parameter-agnostic performance. Empirical results on synthetic NC-SC tasks, robust neural network training, and GANs corroborate the theoretical claims, with D-AdaST consistently outperforming baselines. These findings position D-AdaST as the first distributed adaptive minimax method achieving near-centralized performance without problem-dependent parameter tuning.

Abstract

In this paper, we show that applying adaptive methods directly to distributed minimax problems can result in non-convergence due to inconsistency in locally computed adaptive stepsizes. To address this challenge, we propose D-AdaST, a Distributed Adaptive minimax method with Stepsize Tracking. The key strategy is to employ an adaptive stepsize tracking protocol involving the transmission of two extra (scalar) variables. This protocol ensures the consistency among stepsizes of nodes, eliminating the steady-state error due to the lack of coordination of stepsizes among nodes that commonly exists in vanilla distributed adaptive methods, and thus guarantees exact convergence. For nonconvex-strongly-concave distributed minimax problems, we characterize the specific transient times that ensure time-scale separation of stepsizes and quasi-independence of networks, leading to a near-optimal convergence rate of $\tilde{\mathcal{O}} \left( ε^{-\left( 4+δ\right)} \right)$ for any small $δ> 0$, matching that of the centralized counterpart. To our best knowledge, D-AdaST is the first distributed adaptive method achieving near-optimal convergence without knowing any problem-dependent parameters for nonconvex minimax problems. Extensive experiments are conducted to validate our theoretical results.

Achieving Near-Optimal Convergence for Distributed Minimax Optimization with Adaptive Stepsizes

TL;DR

This work tackles distributed nonconvex-strongly-concave minimax problems where adaptive stepsizes can fail to converge due to inter-node inconsistency. It introduces D-AdaST, a distributed adaptive minimax method with a two-scalar stepsize-tracking mechanism that enforces cross-node consistency and time-scale separation without requiring prior knowledge of problem parameters. The authors prove a near-optimal convergence rate of for any and characterize transient times for network independence, demonstrating parameter-agnostic performance. Empirical results on synthetic NC-SC tasks, robust neural network training, and GANs corroborate the theoretical claims, with D-AdaST consistently outperforming baselines. These findings position D-AdaST as the first distributed adaptive minimax method achieving near-centralized performance without problem-dependent parameter tuning.

Abstract

In this paper, we show that applying adaptive methods directly to distributed minimax problems can result in non-convergence due to inconsistency in locally computed adaptive stepsizes. To address this challenge, we propose D-AdaST, a Distributed Adaptive minimax method with Stepsize Tracking. The key strategy is to employ an adaptive stepsize tracking protocol involving the transmission of two extra (scalar) variables. This protocol ensures the consistency among stepsizes of nodes, eliminating the steady-state error due to the lack of coordination of stepsizes among nodes that commonly exists in vanilla distributed adaptive methods, and thus guarantees exact convergence. For nonconvex-strongly-concave distributed minimax problems, we characterize the specific transient times that ensure time-scale separation of stepsizes and quasi-independence of networks, leading to a near-optimal convergence rate of for any small , matching that of the centralized counterpart. To our best knowledge, D-AdaST is the first distributed adaptive method achieving near-optimal convergence without knowing any problem-dependent parameters for nonconvex minimax problems. Extensive experiments are conducted to validate our theoretical results.
Paper Structure (20 sections, 15 theorems, 113 equations, 9 figures, 2 algorithms)

This paper contains 20 sections, 15 theorems, 113 equations, 9 figures, 2 algorithms.

Key Result

Theorem 1

There exists a distributed minimax problem in the form of Problem (Prob_minimax) and certain initialization such that after running D-TiAda with any $0 < \beta < 0.5 < \alpha<1$ and $\gamma_x, \gamma_y > 0$, it holds that for any $t = 0, 1, 2, \dots$, we have, where $\| \nabla_x f(x_0, y_0) \|$ and $\| \nabla_y f(x_0, y_0) \|$ can be arbitrarily large depending on the initialization.

Figures (9)

  • Figure 1: Comparison among D-SGDA, D-TiAda and for NC-SC quadratic objective function (\ref{['Eq_case_study']}) with $n=2$ nodes and $\gamma_x=\gamma_y$. In (a), it shows the trajectories of primal and dual variables of the algorithms, the points on the black dash line are stationary points of $f$. In (b), it shows the convergence of $\left\| \nabla_x f\left( x_k, y_k \right) \right\| ^2$ over the iterations. In (c), it shows the convergence of the inconsistency of stepsizes, $\zeta_{v}^{2}$ defined in (\ref{['Def_hete_stepsize']}), over the iterations. Notably, $\zeta_{v}^{2}$ fails to converge for D-TiAda and $\zeta_{v}^{2}=0$ for non-adaptive D-SGDA.
  • Figure 2: Performance comparison of algorithms on quadratic functions over exponential graphs with node counts $n=\left\{ 50,100 \right\}$ and different initial stepsizes ($\gamma_y=0.1$).
  • Figure 3: Comparison of the algorithms on training robust CNN on MNIST dataset. The first row shows the results of AdaGrad-like stepsize, and the second row is for Adam-like stepsize. For the first three columns, we compare the algorithms on different graphs with $n=20$. For the last column, we show the scalability of in terms of number of nodes. Initial stepsizes are set as $\gamma _x=0.01, \gamma _y=0.1$ for AdaGrad-like stepsize, and $\gamma _x=0.1, \gamma _y=0.1$ for Adam-like stepsize.
  • Figure 4: Training GANs on CIFAR-10 dataset over exponential graphs with $n=10$ nodes.
  • Figure 5: Performance comparison of training CNN on MNIST with $n=20$ nodes over directed ring and fully connected graphs.
  • ...and 4 more figures

Theorems & Definitions (34)

  • Theorem 1
  • Remark 1
  • Remark 2
  • Remark 3
  • Theorem 2
  • Remark 4: Near-optimal convergence
  • Remark 5: Parameter-agnostic property and transient times
  • Corollary 1
  • Lemma 1: Lemma A.2 in yang2022nest
  • Lemma 2
  • ...and 24 more