Revisiting Gradient Normalization and Clipping for Nonconvex SGD under Heavy-Tailed Noise: Necessity, Sufficiency, and Acceleration
Tao Sun, Xinwang Liu, Kun Yuan
TL;DR
The paper studies SGD in nonconvex settings with heavy-tailed gradient noise and questions the necessity of gradient clipping. It shows that gradient normalization alone suffices for convergence under mild smoothness assumptions and that combining normalization with clipping yields superior convergence rates, removing logarithmic factors and improving constants. The authors extend these results to nonconvex variance-reduced algorithms and introduce an accelerated variant that leverages second-order smoothness to achieve faster rates, all without requiring a mini-batch size. The work provides a unified theory across normalization-only, clipping-only, and combined approaches, offering practical guidance for using normalization and clipping in noisy nonconvex optimization. It also discusses extensions to general normalization operators and highlights the potential for improved generalization due to relaxed minibatch constraints.
Abstract
Gradient clipping has long been considered essential for ensuring the convergence of Stochastic Gradient Descent (SGD) in the presence of heavy-tailed gradient noise. In this paper, we revisit this belief and explore whether gradient normalization can serve as an effective alternative or complement. We prove that, under individual smoothness assumptions, gradient normalization alone is sufficient to guarantee convergence of the nonconvex SGD. Moreover, when combined with clipping, it yields far better rates of convergence under more challenging noise distributions. We provide a unifying theory describing normalization-only, clipping-only, and combined approaches. Moving forward, we investigate existing variance-reduced algorithms, establishing that, in such a setting, normalization alone is sufficient for convergence. Finally, we present an accelerated variant that under second-order smoothness improves convergence. Our results provide theoretical insights and practical guidance for using normalization and clipping in nonconvex optimization with heavy-tailed noise.
