Table of Contents
Fetching ...

Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?

Jihwan Kim, Dogyoon Song, Chulhee Yun

TL;DR

The analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant, and the widely used warmup-stable-decay (WSD) schedule further reduces the noise term and sharpens the compute-optimal slope.

Abstract

We study scaling laws of signSGD under a power-law random features (PLRF) model that accounts for both feature and target decay. We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features. We express the risk as a function of model size, training steps, learning rate, and the feature and target decay parameters. Comparing against the SGD risk analyzed by Paquette et al. (2024), we identify a drift-normalization effect and a noise-reshaping effect unique to signSGD. We then obtain compute-optimal scaling laws under the optimal choice of learning rate. Our analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant. Finally, we observe that the widely used warmup-stable-decay (WSD) schedule further reduces the noise term and sharpens the compute-optimal slope, when feature decay is fast but target decay is slow.

Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?

TL;DR

The analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant, and the widely used warmup-stable-decay (WSD) schedule further reduces the noise term and sharpens the compute-optimal slope.

Abstract

We study scaling laws of signSGD under a power-law random features (PLRF) model that accounts for both feature and target decay. We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features. We express the risk as a function of model size, training steps, learning rate, and the feature and target decay parameters. Comparing against the SGD risk analyzed by Paquette et al. (2024), we identify a drift-normalization effect and a noise-reshaping effect unique to signSGD. We then obtain compute-optimal scaling laws under the optimal choice of learning rate. Our analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant. Finally, we observe that the widely used warmup-stable-decay (WSD) schedule further reduces the noise term and sharpens the compute-optimal slope, when feature decay is fast but target decay is slow.
Paper Structure (133 sections, 5 theorems, 469 equations, 25 figures, 3 tables)

This paper contains 133 sections, 5 theorems, 469 equations, 25 figures, 3 tables.

Key Result

Proposition K.1

$\mathcal{F}_0(N)$ is independent of $N$ and obeys

Figures (25)

  • Figure 1: Left: SGD vs. signSGD;Right: signSGD with constant vs. warmup-stable-decay schedules. Colored lines represent the training trajectories of each algorithm, and black lines denote the compute-optimal curves. The upper right legend in each panel shows the theoretical value of the compute-optimal slope. SignSGD achieves a steeper compute-optimal slope than SGD (left panel), and warmup-stable-decay scheduling sharpens the compute-optimal slope relative to a constant schedule (right panel), for some parameter configurations. See Appendix \ref{['expinapp']} for parameters used in the experiment.
  • Figure 2: Left: Phase plane for signSGD;Right: Phase plane for SGD. The white region indicates parameter values with no power-law scaling. The dark blue area represents the region where warmup-stable-decay scheduling (Section \ref{['sdsch']}) yields a better compute-optimal exponent.
  • Figure 3: Decay of ${\bm{\theta}}^*$ in the basis of columns of ${\bm{U}}$ compared to ${\bm{w}}^*$. The legend on the top shows $(\alpha, \beta, \text{fitted slope of} \ {\bm{U}}^{\mathsf{T}} {\bm{\theta}}^*)$.
  • Figure 4: Phase planes to compare signSGD and SGD. Mint green area covering all of Phase Bb and III, and some part of Phase Ac, Ad, Ba, IV is the area where signSGD has a steeper compute-optimal slope compared to SGD. The left side is the signSGD phase plane, and the right side is the SGD phase plane. We placed the Mint green area for both of them for clarity. We will call this Mint green area as Area $\text{III-IV}_{\text{sub}}$.
  • Figure 5: Phase plane to compare signSGD and DANA-decaying in ferbach2025dimension. Lime green area covering some part of Phase Ac, Ad, Ba, Bb is the area where signSGD has a steeper compute-optimal slope compared to DANA-decaying in ferbach2025dimension.
  • ...and 20 more figures

Theorems & Definitions (15)

  • Remark 1: Dominant vs. balancing terms
  • Remark 2: Early-iteration proxy
  • Remark 3
  • Remark 4: Justification on drift term conversion
  • Proposition K.1
  • proof : Sketch
  • Proposition K.2
  • proof : Sketch
  • Proposition K.3
  • proof : Sketch
  • ...and 5 more