Two-Timescale Gradient Descent Ascent Algorithms for Nonconvex Minimax Optimization

Tianyi Lin; Chi Jin; Michael. I. Jordan

Two-Timescale Gradient Descent Ascent Algorithms for Nonconvex Minimax Optimization

Tianyi Lin, Chi Jin, Michael. I. Jordan

TL;DR

The paper introduces two-timescale gradient descent ascent (TTGDA) and its stochastic variant (TTSGDA) for solving the nonconvex minimax problem $\min_{\mathbf x} \max_{\mathbf y \in \mathcal{Y}} f(\mathbf x,\mathbf y)$ with a convex bounded $\mathcal{Y}$ and $f$ nonconvex in $\mathbf x$ and concave in $\mathbf y$. It provides a unified nonasymptotic analysis in both smooth and nonsmooth settings, covering nonconvex-strongly-concave and nonconvex-concave regimes, and develops a novel proof technique based on slowly changing concave objectives to establish descent and convergence guarantees. The results yield explicit gradient-complexity bounds for both deterministic and stochastic variants, with distinct rates reflecting the problem structure (e.g., $\Theta(\kappa^2)$ step-size asymmetry and dependence on the Moreau envelope in the nonsmooth case). Theoretical findings are complemented by applications to robust logistic regression and Wasserstein GANs, demonstrating practical advantages over single-loop or vanilla GDA baselines and highlighting the potential of TTGDA/TTSGDA in training GANs and robust learning tasks.

Abstract

We provide a unified analysis of two-timescale gradient descent ascent (TTGDA) for solving structured nonconvex minimax optimization problems in the form of $\min_\textbf{x} \max_{\textbf{y} \in Y} f(\textbf{x}, \textbf{y})$, where the objective function $f(\textbf{x}, \textbf{y})$ is nonconvex in $\textbf{x}$ and concave in $\textbf{y}$, and the constraint set $Y \subseteq \mathbb{R}^n$ is convex and bounded. In the convex-concave setting, the single-timescale gradient descent ascent (GDA) algorithm is widely used in applications and has been shown to have strong convergence guarantees. In more general settings, however, it can fail to converge. Our contribution is to design TTGDA algorithms that are effective beyond the convex-concave setting, efficiently finding a stationary point of the function $Φ(\cdot) := \max_{\textbf{y} \in Y} f(\cdot, \textbf{y})$. We also establish theoretical bounds on the complexity of solving both smooth and nonsmooth nonconvex-concave minimax optimization problems. To the best of our knowledge, this is the first systematic analysis of TTGDA for nonconvex minimax optimization, shedding light on its superior performance in training generative adversarial networks (GANs) and in other real-world application problems.

Two-Timescale Gradient Descent Ascent Algorithms for Nonconvex Minimax Optimization

TL;DR

The paper introduces two-timescale gradient descent ascent (TTGDA) and its stochastic variant (TTSGDA) for solving the nonconvex minimax problem

with a convex bounded

and

nonconvex in

and concave in

. It provides a unified nonasymptotic analysis in both smooth and nonsmooth settings, covering nonconvex-strongly-concave and nonconvex-concave regimes, and develops a novel proof technique based on slowly changing concave objectives to establish descent and convergence guarantees. The results yield explicit gradient-complexity bounds for both deterministic and stochastic variants, with distinct rates reflecting the problem structure (e.g.,

step-size asymmetry and dependence on the Moreau envelope in the nonsmooth case). Theoretical findings are complemented by applications to robust logistic regression and Wasserstein GANs, demonstrating practical advantages over single-loop or vanilla GDA baselines and highlighting the potential of TTGDA/TTSGDA in training GANs and robust learning tasks.

Abstract

We provide a unified analysis of two-timescale gradient descent ascent (TTGDA) for solving structured nonconvex minimax optimization problems in the form of

, where the objective function

is nonconvex in

and concave in

, and the constraint set

is convex and bounded. In the convex-concave setting, the single-timescale gradient descent ascent (GDA) algorithm is widely used in applications and has been shown to have strong convergence guarantees. In more general settings, however, it can fail to converge. Our contribution is to design TTGDA algorithms that are effective beyond the convex-concave setting, efficiently finding a stationary point of the function

. We also establish theoretical bounds on the complexity of solving both smooth and nonsmooth nonconvex-concave minimax optimization problems. To the best of our knowledge, this is the first systematic analysis of TTGDA for nonconvex minimax optimization, shedding light on its superior performance in training generative adversarial networks (GANs) and in other real-world application problems.

Paper Structure (37 sections, 23 theorems, 191 equations, 3 figures, 1 table)

This paper contains 37 sections, 23 theorems, 191 equations, 3 figures, 1 table.

Introduction
Notation.
Related Work
Nonconvex-concave setting.
Nonconvex-nonconcave setting.
Other settings.
Preliminaries and Technical Background
Smooth Minimax Optimization
Main results
Discussions
Proof sketch
Nonconvex-strongly-concave setting.
Nonconvex-concave setting.
Nonsmooth Minimax Optimization
Main results
...and 22 more sections

Key Result

Lemma 3.6

If $f(\mathbf x, \mathbf y)$ is $\ell$-smooth, concave in $\mathbf y$, and $\mathcal{Y}$ is convex and bounded, we have that $\Phi$ is $\ell$-weakly convex and the following statements hold true,

Figures (3)

Figure 1: Performance of all the algorithms with 6 LIBSVM datasets. The numerical results are presented in terms of epoch count where the evaluation metric is the gradient norm of the function $\Phi(\cdot) = \max_{\mathbf y \in \mathcal{Y}} f(\cdot, \mathbf y)$.
Figure 2: Performance of all the algorithms for training WGANs with linear generators. The numerical results are presented in terms of iteration count where the evaluation metric is the gradient norm of the function $f(\cdot, \cdot)$.
Figure 3: Performance of all the algorithms for training WGANs with nonlinear generators. The numerical results are presented in terms of iteration count where the evaluation metric is the gradient norm of the function $f(\cdot, \cdot)$.

Theorems & Definitions (47)

Definition 3.1
Definition 3.2
Definition 3.3
Definition 3.4
Definition 3.5
Lemma 3.6
Lemma 3.7
Definition 3.8
Lemma 3.9
Definition 3.10
...and 37 more

Two-Timescale Gradient Descent Ascent Algorithms for Nonconvex Minimax Optimization

TL;DR

Abstract

Two-Timescale Gradient Descent Ascent Algorithms for Nonconvex Minimax Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (47)