Table of Contents
Fetching ...

Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under $(L_0, L_1)$-Smoothness

Nikita Kornilov, Philip Zmushko, Andrei Semenov, Mark Ikonnikov, Alexander Gasnikov, Alexander Beznosikov

TL;DR

The paper addresses non-convex stochastic optimization under generalized $(L_0,L_1)$-smoothness in the presence of heavy-tailed gradient noise and shows that simple sign-based methods (SignSGD with batching, Majority Voting, and momentum variants) achieve new high-probability convergence bounds. It derives explicit sample complexities that scale with problem constants and the noise moment $\kappa$, revealing a two-phase convergence behavior and robustness to the noise distribution. The theoretical contributions encompass HP bounds for standard and generalized smoothness, PL-function restarting analysis, and momentum-based variants, all under mild assumptions. Empirically, sign-based methods demonstrate competitive and often superior performance for large-scale language model training compared with clipping, normalization, and even AdamW, highlighting practical impact for robust, communication-efficient stochastic optimization in real-world deep learning tasks.

Abstract

In recent years, non-convex optimization problems are more often described by generalized $(L_0, L_1)$-smoothness assumption rather than standard one. Meanwhile, severely corrupted data used in these problems has increased the demand for methods capable of handling heavy-tailed noises, i.e., noises with bounded $κ$-th moment. Motivated by these real-world trends and challenges, we explore sign-based methods in this setup and demonstrate their effectiveness in comparison with other popular solutions like clipping or normalization. In theory, we prove the first-known high probability convergence bounds under $(L_0, L_1)$-smoothness and heavy-tailed noises with mild parameter dependencies. In the case of standard smoothness, these bounds are novel for sign-based methods as well. In particular, SignSGD with batching achieves sample complexity $\tilde{O}\left(\left(\frac{ΔL_0d}{\varepsilon^2} + \frac{ΔL_1d^\frac{3}{2}}{\varepsilon}\right)\left[1 + \left(\fracσ{\varepsilon}\right)^\fracκ{κ-1}\right]\right), κ\in (1,2]$. Under the assumption of symmetric noises, SignSGD with Majority Voting can robustly work on the whole range of $κ\in (0,2]$ with complexity $\tilde{O}\left(\left(\frac{ΔL_0d}{\varepsilon^2} + \frac{ΔL_1d^\frac{3}{2}}{\varepsilon}\right)\left[\frac{1}{κ^2} + \frac{σ^2}{\varepsilon^2}\right]\right)$. We also obtain results for parameter-agnostic setups, Polyak-Lojasiewicz functions and momentum-based methods (in expectation). Our theoretical findings are supported by the superior performance of sign-based methods in training Large Language Models compared to clipping and normalization.

Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under $(L_0, L_1)$-Smoothness

TL;DR

The paper addresses non-convex stochastic optimization under generalized -smoothness in the presence of heavy-tailed gradient noise and shows that simple sign-based methods (SignSGD with batching, Majority Voting, and momentum variants) achieve new high-probability convergence bounds. It derives explicit sample complexities that scale with problem constants and the noise moment , revealing a two-phase convergence behavior and robustness to the noise distribution. The theoretical contributions encompass HP bounds for standard and generalized smoothness, PL-function restarting analysis, and momentum-based variants, all under mild assumptions. Empirically, sign-based methods demonstrate competitive and often superior performance for large-scale language model training compared with clipping, normalization, and even AdamW, highlighting practical impact for robust, communication-efficient stochastic optimization in real-world deep learning tasks.

Abstract

In recent years, non-convex optimization problems are more often described by generalized -smoothness assumption rather than standard one. Meanwhile, severely corrupted data used in these problems has increased the demand for methods capable of handling heavy-tailed noises, i.e., noises with bounded -th moment. Motivated by these real-world trends and challenges, we explore sign-based methods in this setup and demonstrate their effectiveness in comparison with other popular solutions like clipping or normalization. In theory, we prove the first-known high probability convergence bounds under -smoothness and heavy-tailed noises with mild parameter dependencies. In the case of standard smoothness, these bounds are novel for sign-based methods as well. In particular, SignSGD with batching achieves sample complexity . Under the assumption of symmetric noises, SignSGD with Majority Voting can robustly work on the whole range of with complexity . We also obtain results for parameter-agnostic setups, Polyak-Lojasiewicz functions and momentum-based methods (in expectation). Our theoretical findings are supported by the superior performance of sign-based methods in training Large Language Models compared to clipping and normalization.

Paper Structure

This paper contains 48 sections, 21 theorems, 145 equations, 3 figures, 6 tables, 6 algorithms.

Key Result

Lemma 1

Consider lower-bounded $(L_0, L_1)$-smooth function $f$ (As. as: bounded, as: smooth) and HT gradient estimates $\Vec{\sigma}_k$ (As. as: pBCM). Then Alg. alg: signSGD after $T$ iterations with non-increasing stepsizes $\gamma_k \leq 1/ (48L_1d^\frac{3}{2}\log\frac{1}{\delta})$ achieves with probabi where $C_T := \max\limits_{k \in \overline{1,T}} \gamma_k \cdot \sum\limits_{\tau=1}^{k-1}\gamma_\

Figures (3)

  • Figure 1: Experimental noise dependencies for $(L_0, L_1)$-smoooth problems.
  • Figure 2: Experimental convergence speed transition for $(L_0, L_1)$-smooth problems.
  • Figure : Comparison of validation perplexity for various optimization methods across LLaMA model scales trained on C4

Theorems & Definitions (35)

  • Lemma 1: SignSGD Convergence Lemma
  • Theorem 1: HP complexity for minibatch-SignSGD
  • Theorem 2: HP complexity for MajorityVote-SignSGD
  • Theorem 3: Complexity for M-SignSGD in expectation
  • Example 1: Power of Norm
  • Example 2: Exponent of the Inner Product
  • Example 3: Logistic Function
  • Example 4: Quadratic Function with Linear Term.
  • Lemma 2
  • Proposition 1: Norm Relation
  • ...and 25 more