Table of Contents
Fetching ...

Omega: Optimistic EMA Gradients

Juan Ramirez, Rohan Sukumaran, Quentin Bertrand, Gauthier Gidel

TL;DR

The paper tackles the instability of stochastic min-max optimization under gradient noise and the high cost of robust deterministic-like methods. It introduces Omega, an optimistic-like algorithm that incorporates an exponential moving average of past gradients into the update, preserving a one-gradient-per-update cost while reducing variance. A momentum variant, OmegaM, is also explored, with experiments across bilinear, quadratic, and quadratic-linear stochastic games showing Omega often outperforms ISOG in bilinear settings and remains competitive elsewhere. The work points to future directions in convergence analysis and applying Omega to practical tasks such as GAN training.

Abstract

Stochastic min-max optimization has gained interest in the machine learning community with the advancements in GANs and adversarial training. Although game optimization is fairly well understood in the deterministic setting, some issues persist in the stochastic regime. Recent work has shown that stochastic gradient descent-ascent methods such as the optimistic gradient are highly sensitive to noise or can fail to converge. Although alternative strategies exist, they can be prohibitively expensive. We introduce Omega, a method with optimistic-like updates that mitigates the impact of noise by incorporating an EMA of historic gradients in its update rule. We also explore a variation of this algorithm that incorporates momentum. Although we do not provide convergence guarantees, our experiments on stochastic games show that Omega outperforms the optimistic gradient method when applied to linear players.

Omega: Optimistic EMA Gradients

TL;DR

The paper tackles the instability of stochastic min-max optimization under gradient noise and the high cost of robust deterministic-like methods. It introduces Omega, an optimistic-like algorithm that incorporates an exponential moving average of past gradients into the update, preserving a one-gradient-per-update cost while reducing variance. A momentum variant, OmegaM, is also explored, with experiments across bilinear, quadratic, and quadratic-linear stochastic games showing Omega often outperforms ISOG in bilinear settings and remains competitive elsewhere. The work points to future directions in convergence analysis and applying Omega to practical tasks such as GAN training.

Abstract

Stochastic min-max optimization has gained interest in the machine learning community with the advancements in GANs and adversarial training. Although game optimization is fairly well understood in the deterministic setting, some issues persist in the stochastic regime. Recent work has shown that stochastic gradient descent-ascent methods such as the optimistic gradient are highly sensitive to noise or can fail to converge. Although alternative strategies exist, they can be prohibitively expensive. We introduce Omega, a method with optimistic-like updates that mitigates the impact of noise by incorporating an EMA of historic gradients in its update rule. We also explore a variation of this algorithm that incorporates momentum. Although we do not provide convergence guarantees, our experiments on stochastic games show that Omega outperforms the optimistic gradient method when applied to linear players.
Paper Structure (21 sections, 21 equations, 5 figures, 4 tables)

This paper contains 21 sections, 21 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Iterates of stochastic gradient descent-ascent (SGD), the independent samples stochastic optimistic gradient method (ISOG) and Omega for a 2D stochastic bilinear game. $\kappa$ indicates the conditioning of the optimization problem. SGD diverges, while ISOG and Omega converge. Omega converges faster than ISOG.
  • Figure 2: Distance to optimum when solving a 100-dimensional stochastic bilinear game with SGD, ISOG, and Omega ($\beta=0.9$).Omega ($\beta=0.9$) reaches the optimum faster than ISOG while SGD diverges.
  • Figure 3: Training dynamics of Omega when solving a 100-dimensional bilinear game for different choices of the EMA decay hyper-parameter $\beta$. We use a learning rate of $0.02$. We notice that an EMA of 0.99 produces larger oscillations compared to the others. An EMA of 0.9 seemed to be the best choice from the experiment, for the given setup
  • Figure 4: Distance to optimum for SGD, ISOG, Omega, SGDM, and OmegaM when solving a 100-dimensional stochastic quadratic game. For the latter 3 methods, we choose $\beta=0.9$. We can see that, given the nature of quadratic games, all algorithms converge with similar behavior.
  • Figure 5: Distance to optimum when solving a 100-dimensional stochastic quadratic-linear game. SGD is used for the quadratic player, and SGD, ISOG, and Omega for the linear player. We can notice that Omega makes progress toward the optimum faster than the other methods.