Fundamental Benefit of Alternating Updates in Minimax Optimization

Jaewook Lee; Hanseul Cho; Chulhee Yun

Fundamental Benefit of Alternating Updates in Minimax Optimization

Jaewook Lee, Hanseul Cho, Chulhee Yun

TL;DR

This work analyzes gradient descent-ascent methods for minimax problems under strongly-convex-strong-concave and Lipschitz-gradient assumptions. It provides global convergence guarantees showing Alt-GDA outperforms Sim-GDA with rates depending on the interaction term $\kappa_{xy}$, and introduces Alex-GDA, a unified extrapolation framework that matches the Extra-gradient rate with fewer gradient evaluations. For bilinear objectives, Alex-GDA achieves linear convergence with a rate $\mathcal{O}\left(\left(\tfrac{L_{xy}}{\mu_{xy}}\right)^2\log(1/\epsilon)\right)$, while standard GDA variants may diverge. The paper complements theory with extensive experiments on SCSC quadratic games and GAN training (WGAN-GP), validating the speedups from alternating updates and extrapolation in practice.

Abstract

The Gradient Descent-Ascent (GDA) algorithm, designed to solve minimax optimization problems, takes the descent and ascent steps either simultaneously (Sim-GDA) or alternately (Alt-GDA). While Alt-GDA is commonly observed to converge faster, the performance gap between the two is not yet well understood theoretically, especially in terms of global convergence rates. To address this theory-practice gap, we present fine-grained convergence analyses of both algorithms for strongly-convex-strongly-concave and Lipschitz-gradient objectives. Our new iteration complexity upper bound of Alt-GDA is strictly smaller than the lower bound of Sim-GDA; i.e., Alt-GDA is provably faster. Moreover, we propose Alternating-Extrapolation GDA (Alex-GDA), a general algorithmic framework that subsumes Sim-GDA and Alt-GDA, for which the main idea is to alternately take gradients from extrapolations of the iterates. We show that Alex-GDA satisfies a smaller iteration complexity bound, identical to that of the Extra-gradient method, while requiring less gradient computations. We also prove that Alex-GDA enjoys linear convergence for bilinear problems, for which both Sim-GDA and Alt-GDA fail to converge at all.

Fundamental Benefit of Alternating Updates in Minimax Optimization

TL;DR

, and introduces Alex-GDA, a unified extrapolation framework that matches the Extra-gradient rate with fewer gradient evaluations. For bilinear objectives, Alex-GDA achieves linear convergence with a rate

, while standard GDA variants may diverge. The paper complements theory with extensive experiments on SCSC quadratic games and GAN training (WGAN-GP), validating the speedups from alternating updates and extrapolation in practice.

Abstract

Paper Structure (86 sections, 54 theorems, 460 equations, 2 figures, 2 tables, 3 algorithms)

This paper contains 86 sections, 54 theorems, 460 equations, 2 figures, 2 tables, 3 algorithms.

Introduction
Summary of Contributions
Preliminaries
Function Class
Algorithms
Lyapunov Function
Convergence Analysis of Sim-GDA
Convergence Upper Bound
Convergence Lower Bound
Convergence Analysis of Alt-GDA
Convergence Upper Bound
Comparison with Sim-GDA.
Comparison with Local Analysis.
Alternating-Extrapolation GDA
Initialization.
...and 71 more sections

Key Result

Theorem 3.1

Suppose that $f \in {\mathcal{F}} (\mu_x, \mu_y, L_x, L_y, L_{xy})$. Then, there exists a pair of step sizes $\alpha, \beta$ with such that Sim-GDA satisfies $\Psi^{\text{Sim}}_{k+1} \le r \Psi^{\text{Sim}}_{k}$ with

Figures (2)

Figure 1: (Top) Comparing the convergence speeds of algorithms: Sim-GDA, Alt-GDA, EG, OGD and Alex-GDA. (Bottom) Trajectory of the algorithms. (Partial visualization. Originally, the trajectory is $6$-dimensional since $d_x=d_y=3$).
Figure 2: Guessing the complexity bound of Sim-/Alt-GDA.Left: log-log plot between $\kappa = L/\mu$ and the near-optimal worst-case complexity. Right: Slope of the log-log plot. Each point corresponds to the slope of a line segment connecting a pair of adjacent points in the left plot.

Theorems & Definitions (96)

Definition 2.1: Strong convexity/concavity
Definition 2.2: Strong-convex-strong-concavity
Definition 2.3: Lipschitz gradients
Definition 2.4: Condition numbers
Definition 2.5: Function class
Definition 2.6
Definition 2.7: Lyapunov function
Definition 2.8
Theorem 3.1
Corollary 3.1
...and 86 more

Fundamental Benefit of Alternating Updates in Minimax Optimization

TL;DR

Abstract

Fundamental Benefit of Alternating Updates in Minimax Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (96)