Table of Contents
Fetching ...

SDEs for Minimax Optimization

Enea Monzio Compagnoni, Antonio Orvieto, Hans Kersting, Frank Norbert Proske, Aurelien Lucchi

TL;DR

The paper provides a formal stochastic-differential-equation (SDE) framework for analyzing minimax optimizers, deriving SDE models for SGDA, SEG, and SHGD as weak approximations of their discrete updates. It reveals how hyperparameters, such as the SEG extra stepsize ρ, interact with gradient noise and landscape curvature to produce implicit regularization and curvature-induced diffusion, enabling unified Itô-calculus based analyses of convergence and dynamic behavior. The study identifies regimes where SEG behaves like SGDA, and where curvature-aware SHGD introduces explicit curvature-driven noise, plus exact dynamics for quadratic games that illustrate a trade-off between convergence speed and asymptotic accuracy. Empirical validation confirms the SDEs capture key trajectories and variance properties across landscapes, and the work provides concrete convergence conditions and scheduler designs to ensure convergence. Overall, the framework offers a principled, analyzable lens to compare minimax optimizers and informs design choices for robust stochastic optimization in complex games.

Abstract

Minimax optimization problems have attracted a lot of attention over the past few years, with applications ranging from economics to machine learning. While advanced optimization methods exist for such problems, characterizing their dynamics in stochastic scenarios remains notably challenging. In this paper, we pioneer the use of stochastic differential equations (SDEs) to analyze and compare Minimax optimizers. Our SDE models for Stochastic Gradient Descent-Ascent, Stochastic Extragradient, and Stochastic Hamiltonian Gradient Descent are provable approximations of their algorithmic counterparts, clearly showcasing the interplay between hyperparameters, implicit regularization, and implicit curvature-induced noise. This perspective also allows for a unified and simplified analysis strategy based on the principles of Itô calculus. Finally, our approach facilitates the derivation of convergence conditions and closed-form solutions for the dynamics in simplified settings, unveiling further insights into the behavior of different optimizers.

SDEs for Minimax Optimization

TL;DR

The paper provides a formal stochastic-differential-equation (SDE) framework for analyzing minimax optimizers, deriving SDE models for SGDA, SEG, and SHGD as weak approximations of their discrete updates. It reveals how hyperparameters, such as the SEG extra stepsize ρ, interact with gradient noise and landscape curvature to produce implicit regularization and curvature-induced diffusion, enabling unified Itô-calculus based analyses of convergence and dynamic behavior. The study identifies regimes where SEG behaves like SGDA, and where curvature-aware SHGD introduces explicit curvature-driven noise, plus exact dynamics for quadratic games that illustrate a trade-off between convergence speed and asymptotic accuracy. Empirical validation confirms the SDEs capture key trajectories and variance properties across landscapes, and the work provides concrete convergence conditions and scheduler designs to ensure convergence. Overall, the framework offers a principled, analyzable lens to compare minimax optimizers and informs design choices for robust stochastic optimization in complex games.

Abstract

Minimax optimization problems have attracted a lot of attention over the past few years, with applications ranging from economics to machine learning. While advanced optimization methods exist for such problems, characterizing their dynamics in stochastic scenarios remains notably challenging. In this paper, we pioneer the use of stochastic differential equations (SDEs) to analyze and compare Minimax optimizers. Our SDE models for Stochastic Gradient Descent-Ascent, Stochastic Extragradient, and Stochastic Hamiltonian Gradient Descent are provable approximations of their algorithmic counterparts, clearly showcasing the interplay between hyperparameters, implicit regularization, and implicit curvature-induced noise. This perspective also allows for a unified and simplified analysis strategy based on the principles of Itô calculus. Finally, our approach facilitates the derivation of convergence conditions and closed-form solutions for the dynamics in simplified settings, unveiling further insights into the behavior of different optimizers.
Paper Structure (72 sections, 54 theorems, 321 equations, 9 figures, 1 algorithm)

This paper contains 72 sections, 54 theorems, 321 equations, 9 figures, 1 algorithm.

Key Result

Theorem 3.3

Under sufficient regularity conditions, the solution of the following SDE is an order $1$ weak approximation of the discrete update of SGDA eq:SGDA_Discr_Update: where $\Sigma(z)$ is the noise covariance and $\xi_{\gamma}(z):= F \left(z\right) - F_{\gamma}\left(z\right)$ the noise in the sample $F_\gamma$.

Figures (9)

  • Figure 1: Empirical validation of Theorem \ref{['thm:SEG_SDE_Insights']} and \ref{['thm:SHGD_SDE_Insights']}: The trajectories of the simulated SDEs match those of the respective algorithms averaged over $5$ runs - That of SGDA gets trapped in limit cycles as well (Top Left); That of SHGD converges to the optimum of a highly nonlinear landscape (Bottom Left); The SDE of SGDA would not be a good model for SEG (Top Right); The SDEs and the optimizers move along the trajectory at the same speed (Bottom Right). For a description of the landscapes and of the simulation settings for the SDEs, see Appendix \ref{['app:Experiments']}.
  • Figure 2: Graphical representation of the implicit regularization of the vector field of SEG for $f(x,y) = x y$: $-F$ spins the dynamics in a circle (Top Left); $+ \nabla F F$ pulls it towards $0$ (Top Right); If $\rho$ is small, $-F + \rho \nabla F F$ combines the two fields and spirals towards the origin (Bottom Left); If $\rho$ is large, $-F + \rho \nabla F F$ is a chaotic field that makes the dynamics diverge (Bottom Right).
  • Figure 3: Empirical validation of Prop. \ref{['prop:SHGD_Convergence_PIBG_NoG_Insights_NoSched']} and Prop. \ref{['prop:SHGD_Convergence_PIBG_NoG_Insights_Sched']} (Left); Prop. \ref{['prop:SEG_Convergence_PIBG_NoG_Insights_NoSched']} and Prop. \ref{['prop:SEG_Convergence_PIBG_NoG_Insights_Sched']} (Right): The dynamics of ${\mathbb E} \left[\lVert Z_t \rVert^2 \right]$ averaged across $5$ runs perfectly matches that prescribed by our results for all schedulers. Both for SEG and SHGD, $\eta=0.01$, while $\rho=1$.
  • Figure 4: Comparison between SEG and SHGD on Quadratic Games: (Left), $\rho^{V}$ and $\rho^{H}$ meet the designated goals, sometimes negative$\rho$ is desirable as positive ones slow down the convergence. Large $|\rho|$ induces faster convergence which in turn results in larger suboptimality. (Right), negative $\rho$ escapes the bad saddle faster than SGDA, positive ones induce convergence, and $\rho^{H}$ matches the decay of SHGD. In both experiments, $\eta=0.01$.
  • Figure 5: Empirical validation of Prop. \ref{['prop:SHGD_Convergence_PIBG_NoG_Insights_NoSched']} and Prop. \ref{['prop:SHGD_Convergence_PIBG_NoG_Insights_Sched']} (Left); Prop. \ref{['prop:SEG_Convergence_PIBG_NoG_Insights_NoSched']} and Prop. \ref{['prop:SEG_Convergence_PIBG_NoG_Insights_Sched']} (Right): The dynamics of ${\mathbb E} \left[\lVert Z_t \rVert^2 \right]$ averaged across $5$ runs perfectly matches that prescribed by our results for all schedulers. Both for SEG and SHGD, $\eta=0.01$, while $\rho=2$.
  • ...and 4 more figures

Theorems & Definitions (95)

  • Definition 3.2: Weak Approximation
  • Theorem 3.3: SGDA SDE - Informal Statement of Theorem \ref{['thm:SGDA_SDE']}
  • Theorem 3.4: Informal Statement of Theorem \ref{['thm:SEG_SDE']}
  • proof
  • Corollary 3.5: Informal Statement of Corollary \ref{['thm:SEG_SDE_Simplified_SameSample']}
  • proof
  • Theorem 3.6: SHGD SDE - Informal Statement of Theorem \ref{['thm:SHGD_SDE']}
  • Corollary 3.7: Informal Statement of Corollary \ref{['thm:SHGD_SDE_Simplified_SameSample']}
  • Theorem 4.1: SHGD General Convergence
  • proof
  • ...and 85 more