Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise

Yuchen Fang; James Demmel; Javad Lavaei

Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise

Yuchen Fang, James Demmel, Javad Lavaei

TL;DR

It is demonstrated that normalization guarantees convergence to a first-order stationary point at rate $\mathcal{O}(T^{-\frac{p-1}{3p-2}})$ when problem parameters are known, and $\mathcal{O}(T^{-\frac{p-1}{2p}})$ when problem parameters are unknown, matching the optimal rates for normalized SGD, respectively.

Abstract

We develop a worst-case complexity theory for stochastically preconditioned stochastic gradient descent (SPSGD) and its accelerated variants under heavy-tailed noise, a setting that encompasses widely used adaptive methods such as Adam, RMSProp, and Shampoo. We assume the stochastic gradient noise has a finite $p$-th moment for some $p \in (1,2]$, and measure convergence after $T$ iterations. While clipping and normalization are parallel tools for stabilizing training of SGD under heavy-tailed noise, there is a fundamental separation in their worst-case properties in stochastically preconditioned settings. We demonstrate that normalization guarantees convergence to a first-order stationary point at rate $\mathcal{O}(T^{-\frac{p-1}{3p-2}})$ when problem parameters are known, and $\mathcal{O}(T^{-\frac{p-1}{2p}})$ when problem parameters are unknown, matching the optimal rates for normalized SGD, respectively. In contrast, we prove that clipping may fail to converge in the worst case due to the statistical dependence between the stochastic preconditioner and the gradient estimates. To enable the analysis, we develop a novel vector-valued Burkholder-type inequality that may be of independent interest. These results provide a theoretical explanation for the empirical preference for normalization over clipping in large-scale model training.

Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise

TL;DR

It is demonstrated that normalization guarantees convergence to a first-order stationary point at rate

when problem parameters are known, and

when problem parameters are unknown, matching the optimal rates for normalized SGD, respectively.

Abstract

-th moment for some

, and measure convergence after

iterations. While clipping and normalization are parallel tools for stabilizing training of SGD under heavy-tailed noise, there is a fundamental separation in their worst-case properties in stochastically preconditioned settings. We demonstrate that normalization guarantees convergence to a first-order stationary point at rate

when problem parameters are known, and

when problem parameters are unknown, matching the optimal rates for normalized SGD, respectively. In contrast, we prove that clipping may fail to converge in the worst case due to the statistical dependence between the stochastic preconditioner and the gradient estimates. To enable the analysis, we develop a novel vector-valued Burkholder-type inequality that may be of independent interest. These results provide a theoretical explanation for the empirical preference for normalization over clipping in large-scale model training.

Paper Structure (28 sections, 20 theorems, 144 equations, 1 table, 1 algorithm)

This paper contains 28 sections, 20 theorems, 144 equations, 1 table, 1 algorithm.

Introduction
Related work
Stochastically Preconditioned Methods
Takeaway: A Geometric Perspective
Normalization Ensures Convergence
When all algorithmic parameters are known
When all algorithmic parameters are unknown
Step normalization is robust
Vector-valued Burkholder-type inequality
What Might Go Wrong for Clipping?
Clipping-Then-Preconditioning
Preconditioning-Then-Clipping
Discussion
Applications to large-scale machine learning
Clipping is still important
...and 13 more sections

Key Result

Theorem 4.5

Under Assumptions assumption1, assumption2, assumption3, and assumption4, let $\Delta \coloneqq f(\boldsymbol{x}_1) - f_*$. For any $T \geq 1$, we select $\eta= \sqrt{ \frac{(1 - \theta) \Delta}{L T}}$, $\theta = 1 - \min \left\{ 1, \max \left\{\left( \frac{\Delta L }{\sigma^2 T}\right)^{\frac{p}{3p

Theorems & Definitions (37)

Theorem 4.5
Theorem 4.6
Remark 4.7
Remark 4.8
Lemma 4.9
Lemma 4.10: Vector-valued Burkholder-type inequality
Lemma 5.2
Lemma 5.3
Remark 5.4
Example 5.5
...and 27 more

Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise

TL;DR

Abstract

Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (37)