Table of Contents
Fetching ...

Complexity of normalized stochastic first-order methods with momentum under heavy-tailed noise

Chuan He, Zhaosong Lu, Defeng Sun, Zhanwang Deng

TL;DR

The paper addresses unconstrained smooth optimization under heavy-tailed stochastic gradient noise by developing three practical normalized SFOMs with momentum: Polyak, multi-extrapolated, and recursive momentum. Each method uses dynamically updated parameters and normalization to avoid dependence on unknown Lipschitz constants or noise bounds, achieving first-order oracle complexities that either match or improve the best-known results under heavy-tailed noise and weaker smoothness assumptions. The authors extend the analysis to higher-order smoothness to obtain accelerated rates for the multi-extrapolated variant and to a weakly average smoothness regime for the recursive variant. Comprehensive numerical experiments on data fitting, robust regression, and multimodal contrastive learning validate the practical effectiveness and illustrate parameter-tuning and momentum impacts. Overall, the work provides parameter-free or parameter-light SFOMs with strong theoretical guarantees and practical performance in the presence of heavy-tailed noise.

Abstract

In this paper, we propose practical normalized stochastic first-order methods with Polyak momentum, multi-extrapolated momentum, and recursive momentum for solving unconstrained optimization problems. These methods employ dynamically updated algorithmic parameters and do not require explicit knowledge of problem-dependent quantities such as the Lipschitz constant or noise bound. We establish first-order oracle complexity results for finding approximate stochastic stationary points under heavy-tailed noise and weakly average smoothness conditions -- both of which are weaker than the commonly used bounded variance and mean-squared smoothness assumptions. Our complexity bounds either improve upon or match the best-known results in the literature. Numerical experiments are presented to demonstrate the practical effectiveness of the proposed methods.

Complexity of normalized stochastic first-order methods with momentum under heavy-tailed noise

TL;DR

The paper addresses unconstrained smooth optimization under heavy-tailed stochastic gradient noise by developing three practical normalized SFOMs with momentum: Polyak, multi-extrapolated, and recursive momentum. Each method uses dynamically updated parameters and normalization to avoid dependence on unknown Lipschitz constants or noise bounds, achieving first-order oracle complexities that either match or improve the best-known results under heavy-tailed noise and weaker smoothness assumptions. The authors extend the analysis to higher-order smoothness to obtain accelerated rates for the multi-extrapolated variant and to a weakly average smoothness regime for the recursive variant. Comprehensive numerical experiments on data fitting, robust regression, and multimodal contrastive learning validate the practical effectiveness and illustrate parameter-tuning and momentum impacts. Overall, the work provides parameter-free or parameter-light SFOMs with strong theoretical guarantees and practical performance in the presence of heavy-tailed noise.

Abstract

In this paper, we propose practical normalized stochastic first-order methods with Polyak momentum, multi-extrapolated momentum, and recursive momentum for solving unconstrained optimization problems. These methods employ dynamically updated algorithmic parameters and do not require explicit knowledge of problem-dependent quantities such as the Lipschitz constant or noise bound. We establish first-order oracle complexity results for finding approximate stochastic stationary points under heavy-tailed noise and weakly average smoothness conditions -- both of which are weaker than the commonly used bounded variance and mean-squared smoothness assumptions. Our complexity bounds either improve upon or match the best-known results in the literature. Numerical experiments are presented to demonstrate the practical effectiveness of the proposed methods.

Paper Structure

This paper contains 16 sections, 24 theorems, 144 equations, 9 figures, 1 table, 3 algorithms.

Key Result

Theorem 1

Suppose that Assumption asp:basic holds. Let $f_{\mathrm{low}}$, $L_1$, $\sigma$, and $\alpha$ be given in Assumption asp:basic, and define Let $\{x^k\}$ be generated by Algorithm alg:unf-sfom-pm with input parameters $\{(\eta_k,\theta_k)\}$ given by Then, for any $\epsilon \in (0,1)$, it holds that $\mathbb{E}[\|\nabla f(x^{\iota_K})\|]\le \epsilon$ for all $K$ satisfying where $\iota_K$ is un

Figures (9)

  • Figure 1: Convergence behavior of the relative objective value gap (first row) and relative gradient norm (second row) for all the methods when solving problem \ref{['df']}.
  • Figure 2: Distributions of gradient errors $\|G(x;\xi)-\nabla f(x)\|$ (first row) and Lipschitz constant estimates $\|G(y;\xi)-G(x;\xi)\|/\|y-x\|$ (second row) compared against a normal distribution (QQ-plot), when solving \ref{['robust-reg']}. Here, the gradient errors are calculated for the first epoch of optimization, and the Lipschitz constant estimates are taken over every two consecutive iterates within the first epoch of optimization for all methods.
  • Figure 3: Convergence behavior of the relative objective value gap (first row) and relative gradient norm (second row) for all the methods when solving problem \ref{['robust-reg']}.
  • Figure 4: Distributions of gradient errors $\|G(x;\xi)-\nabla f(x)\|$ (first row) and Lipschitz constant estimates $\|G(x;\xi)-G(y;\xi)\|/\|x-y\|$ (second row) compared against a normal distribution (QQ-plot), when solving \ref{['multi-modal']}. Here, the gradient errors are calculated for the first epoch of training, and the Lipschitz constant estimates are taken over every two consecutive iterates within the first epoch of training for all methods.
  • Figure 5: Convergence behavior of the relative objective value gap (first row) and relative gradient norm (second row) for all the methods when solving problem \ref{['multi-modal']}.
  • ...and 4 more figures

Theorems & Definitions (53)

  • Remark 1
  • Theorem 1: complexity with known $\alpha$
  • Theorem 2: complexity with unknown $\alpha$
  • Remark 2
  • Theorem 3: complexity with known $\alpha$
  • Theorem 4: complexity with unknown $\alpha$
  • Remark 3
  • Remark 4
  • Theorem 5: complexity with known $\alpha$
  • Theorem 6: complexity with unknown $\alpha$
  • ...and 43 more