Stable Nonconvex-Nonconcave Training via Linear Interpolation

Thomas Pethick; Wanyun Xie; Volkan Cevher

Stable Nonconvex-Nonconcave Training via Linear Interpolation

Thomas Pethick, Wanyun Xie, Volkan Cevher

TL;DR

A new optimization scheme called relaxed approximate proximal point (RAPP), which is the first explicit method without anchoring to achieve last iterate convergence rates for $\rho$-comonotone problems while only requiring $\rho>-\tfrac{1}{2L}$.

Abstract

This paper presents a theoretical analysis of linear interpolation as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear interpolation can help by leveraging the theory of nonexpansive operators. We construct a new optimization scheme called relaxed approximate proximal point (RAPP), which is the first explicit method without anchoring to achieve last iterate convergence rates for $ρ$-comonotone problems while only requiring $ρ> -\tfrac{1}{2L}$. The construction extends to constrained and regularized settings. By replacing the inner optimizer in RAPP we rediscover the family of Lookahead algorithms for which we establish convergence in cohypomonotone problems even when the base optimizer is taken to be gradient descent ascent. The range of cohypomonotone problems in which Lookahead converges is further expanded by exploiting that Lookahead inherits the properties of the base optimizer. We corroborate the results with experiments on generative adversarial networks which demonstrates the benefits of the linear interpolation present in both RAPP and Lookahead.

Stable Nonconvex-Nonconcave Training via Linear Interpolation

TL;DR

A new optimization scheme called relaxed approximate proximal point (RAPP), which is the first explicit method without anchoring to achieve last iterate convergence rates for

-comonotone problems while only requiring

Abstract

-comonotone problems while only requiring

. The construction extends to constrained and regularized settings. By replacing the inner optimizer in RAPP we rediscover the family of Lookahead algorithms for which we establish convergence in cohypomonotone problems even when the base optimizer is taken to be gradient descent ascent. The range of cohypomonotone problems in which Lookahead converges is further expanded by exploiting that Lookahead inherits the properties of the base optimizer. We corroborate the results with experiments on generative adversarial networks which demonstrates the benefits of the linear interpolation present in both RAPP and Lookahead.

Paper Structure (38 sections, 3 theorems, 95 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 38 sections, 3 theorems, 95 equations, 8 figures, 4 tables, 1 algorithm.

Additional related work
Stochastic feedback
Halpern-type
Preliminaries
Relationship between weak Minty variational inequilities and cohypomonotonicity
Introduction
Related work
Lookahead
Cohypomonotone
Proximal point
Setup
Proofs for \ref{['sec:ikm']} (Inexact Krasnosel'skiĭ-Mann iterations)
Inexact Krasnosel'skiĭ-Mann iterations
Generalizing to conic nonexpansiveness
Proofs for \ref{['sec:onestep']} (Approximating the resolvent)
...and 23 more sections

Key Result

lemma 1

Suppose ass:F:Lips holds and $\gamma \leq 1/L$. Then, the mapping $\HC[] = \operatorname{id} - \gamma F$ is $1/2$-cocoercive for all $u \in \mathbb{R}^d$. Specifically,

Figures (8)

Figure 1: Overview of results and relationship between methods.
Figure 2: Consider $\min _{x \in \mathcal{X}} \max _{y \in \mathcal{Y}}\phi(z)$ with $z=(x,y)$. As opposed to convex-concave minimax problems, the cohypomonotone condition allows the gradients $Fz= (\nabla_x \phi(z), -\nabla_y \phi(z))$ to point away from the solutions (see \ref{['app:weakMVI-comonotone']} for the relationship between cohypomonotonicity and the weak MVI). This can lead to instability issues for standard algorithms such as the Adam optimizer.
Figure 3: \ref{['eq:lookahead']} and \ref{['alg:inexactResolvent']} can converge for hsieh2021limits. Interestingly, we can set the stepsize $\gamma$ larger than $1/L$ while \ref{['alg:inexactResolvent']} remains stable. Approximate proximal point (APP) with the same stepsize diverges (the iterates of APP are deferred to \ref{['fig:APPM:iterates']}). In this example, it is apparent from the rates, that there is a benefit in replacing the conservative inner update in \ref{['alg:inexactResolvent']} with GDA in \ref{['eq:lookahead']} as explored in \ref{['sec:lookahead']}.
Figure 4: We test the Lookahead variants on pethick2022escaping where $\rho \in (-1/8L,-1/10L)$ (left) and pethick2022escaping with $\rho = -1/3$ (right). For the left example \ref{['eq:lookahead']} (provably) converges for $\tau=2$, but may be nonconvergent for larger $\tau$ as illustrate. Both variants of \ref{['eq:lookahead']} diverges in the more difficult example on the right, while \ref{['eq:LA-CEG+']} in contrast provably converges. It seems that \ref{['eq:LA-CEG+']} trades off a constant slowdown in the rate for convergence in a larger class.
Figure 5: The iterates of APP associated with \ref{['fig:forsaken']}.
...and 3 more figures

Theorems & Definitions (29)

Definition 1
Definition 2: (co)monotonicity bauschke2021generalized
Definition 3: Lipschitz continuity and cocoercivity
lemma 1: pethick2022escaping
proof
Example 4
Remark 6
Definition 8
Remark 9
Remark 10
...and 19 more

Stable Nonconvex-Nonconcave Training via Linear Interpolation

TL;DR

Abstract

Stable Nonconvex-Nonconcave Training via Linear Interpolation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (29)