Table of Contents
Fetching ...

Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects

Ke Liang Xiao, Noah Marshall, Atish Agarwala, Elliot Paquette

TL;DR

This work delivers a rigorous high-dimensional analysis of signSGD, deriving a limiting SDE (signHSGD) and a deterministic risk-evolution ODE that together quantify how preconditioning and noise shape learning. By isolating four key effects—effective learning rate, noise compression, diagonal preconditioning, and gradient-noise reshaping—it provides precise, data- and noise-dependent insights into when signSGD outperforms vanilla SGD and how to schedule its updates. The framework also yields a concrete link to Adam via a homogenized perspective, offering a principled path to understanding adaptive optimizers in high dimensions. Overall, the results contribute a quantitative theory of sign-based optimization with practical implications for learning-rate scheduling and preconditioning in large-scale settings.

Abstract

In recent years, signSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers like Adam. Though there is a general consensus that signSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of signSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of signSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.

Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects

TL;DR

This work delivers a rigorous high-dimensional analysis of signSGD, deriving a limiting SDE (signHSGD) and a deterministic risk-evolution ODE that together quantify how preconditioning and noise shape learning. By isolating four key effects—effective learning rate, noise compression, diagonal preconditioning, and gradient-noise reshaping—it provides precise, data- and noise-dependent insights into when signSGD outperforms vanilla SGD and how to schedule its updates. The framework also yields a concrete link to Adam via a homogenized perspective, offering a principled path to understanding adaptive optimizers in high dimensions. Overall, the results contribute a quantitative theory of sign-based optimization with practical implications for learning-rate scheduling and preconditioning in large-scale settings.

Abstract

In recent years, signSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers like Adam. Though there is a general consensus that signSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of signSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of signSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.

Paper Structure

This paper contains 37 sections, 28 theorems, 260 equations, 9 figures, 1 table.

Key Result

Theorem 1

Given Assumptions ass:linear_plus_noise_targets–ass:concentration_of_init and choosing any fixed fixed $p>0$, there exists a constant $C(\overline{\mathbf{K}},\epsilon)>0$ such that for any $\delta \in (1/3,1/2)$ and all $T>3$, with probability at least $1-c(p,\overline{\mathbf{K}})d^{p(1/3 -\delta)}$ for a constant $c(p,\overline{\mathbf{K}})$ independent to $d$.

Figures (9)

  • Figure 1: Dynamics of the risk under signSGD and signHSGD on synthetic and real datasets. signHSGD and its deterministic equivalent ODE are good models for the risk dynamics even for $d = 500$ (a, b) or on real datasets (c, d). The convergence of signSGD for Cauchy noise (b) is remarkable given that SGD fails to converge there. The usefulness of the ODE on CIFAR10 and IMDB movie reviews is remarkable due to the non-Gaussian nature of the data, and the significant estimation of key quantities like ${\it_*}$ or $\epsilon$. For the CIFAR10 dataset, we validate the results of Theorem \ref{['thm:odeconvergence']} which gives the limit risk of signODE under Gaussian data. We include the deterministic equivalent for SGD ( vanillaODE) for reference. Details of these experiments may be found in Appendix \ref{['app:exp_details']}.
  • Figure 2: Examples of $\varphi$ for simple noise distributions. $\sqrt{\operatorname{Levy}}$ has Cauchy type-tails and vanishing density near $0$. We note that $\varphi(x)$ is trivially bounded above by $\tfrac{2}{\pi}$ and converges to $\tfrac{2}{\pi}$ as $x\to \infty$; the rate of convergence at $\infty$ is related to the tail decay rate. At $0$, $\varphi(x)/\sqrt{x}$ converges to the density of the noise at $0$ scaled by $2/\pi$.
  • Figure 3: Top: $\psi$ for Student's-$t$. Here $\psi$ is always greater than $1$ and $\epsilon$-compression accelerates signSGD. For sufficiently small $df$, $\psi > \pi/2$ over some range of $\mathcal{R}$ and signSGD also converges faster than SGD in the isotropic setting. Bottom: $\psi$ for $N(0,\dutchcal{v}^2)$, Rademacher, $\operatorname{Unif}(-1,1)$.
  • Figure 4: Log eigenvalues of $\mathbf{K}, \overline{\mathbf{K}}, \mathbf{K}_{\sigma}$ computed for the CIFAR10 dataset.
  • Figure 5: A demonstration that signSGD, signHSGD, and their deterministic equivalent concentrate in high-dimensions over long time scales. In the limit as $d \to \infty$ our main theorem shows that all these objects become the same.
  • ...and 4 more figures

Theorems & Definitions (53)

  • Definition 1
  • Definition 2: signHSGD
  • Remark 1
  • Theorem 1: Main Theorem, part 1
  • Theorem 2: Main Theorem, part 2
  • Theorem 3
  • Theorem 4
  • Definition 3
  • Definition 4
  • Lemma 1
  • ...and 43 more