Table of Contents
Fetching ...

Smoothed Gradient Clipping and Error Feedback for Decentralized Optimization under Symmetric Heavy-Tailed Noise

Shuhua Yu, Dusan Jakovetic, Soummya Kar

TL;DR

A smoothed clipping operator is developed, and a decentralized gradient method equipped with an error feedback mechanism is proposed that achieves a mean-square error (MSE) convergence rate of $O(1/t^\delta)$, where the exponent $\delta$ is independent of the existence of higher order gradient noise moments and lower bounded by some constant dependent on condition number.

Abstract

Motivated by understanding and analysis of large-scale machine learning under heavy-tailed gradient noise, we study decentralized optimization with gradient clipping, i.e., in which certain clipping operators are applied to the gradients or gradient estimates computed from local nodes prior to further processing. While vanilla gradient clipping has proven effective in mitigating the impact of heavy-tailed gradient noise in non-distributed setups, it incurs bias that causes convergence issues in heterogeneous distributed settings. To address the inherent bias introduced by gradient clipping, we develop a smoothed clipping operator, and propose a decentralized gradient method equipped with an error feedback mechanism, i.e., the clipping operator is applied on the difference between some local gradient estimator and local stochastic gradient. We consider strongly convex and smooth local functions under symmetric heavy-tailed gradient noise that may not have finite moments of order greater than one. We show that the proposed decentralized gradient clipping method achieves a mean-square error (MSE) convergence rate of $O(1/t^δ)$, $δ\in (0, 2/5)$, where the exponent $δ$ is independent of the existence of higher order gradient noise moments $α> 1$ and lower bounded by some constant dependent on condition number. To the best of our knowledge, this is the first MSE convergence result for decentralized gradient clipping under heavy-tailed noise without assuming bounded gradient. Numerical experiments validate our theoretical findings.

Smoothed Gradient Clipping and Error Feedback for Decentralized Optimization under Symmetric Heavy-Tailed Noise

TL;DR

A smoothed clipping operator is developed, and a decentralized gradient method equipped with an error feedback mechanism is proposed that achieves a mean-square error (MSE) convergence rate of , where the exponent is independent of the existence of higher order gradient noise moments and lower bounded by some constant dependent on condition number.

Abstract

Motivated by understanding and analysis of large-scale machine learning under heavy-tailed gradient noise, we study decentralized optimization with gradient clipping, i.e., in which certain clipping operators are applied to the gradients or gradient estimates computed from local nodes prior to further processing. While vanilla gradient clipping has proven effective in mitigating the impact of heavy-tailed gradient noise in non-distributed setups, it incurs bias that causes convergence issues in heterogeneous distributed settings. To address the inherent bias introduced by gradient clipping, we develop a smoothed clipping operator, and propose a decentralized gradient method equipped with an error feedback mechanism, i.e., the clipping operator is applied on the difference between some local gradient estimator and local stochastic gradient. We consider strongly convex and smooth local functions under symmetric heavy-tailed gradient noise that may not have finite moments of order greater than one. We show that the proposed decentralized gradient clipping method achieves a mean-square error (MSE) convergence rate of , , where the exponent is independent of the existence of higher order gradient noise moments and lower bounded by some constant dependent on condition number. To the best of our knowledge, this is the first MSE convergence result for decentralized gradient clipping under heavy-tailed noise without assuming bounded gradient. Numerical experiments validate our theoretical findings.
Paper Structure (16 sections, 16 theorems, 95 equations, 2 figures, 1 table, 2 algorithms)

This paper contains 16 sections, 16 theorems, 95 equations, 2 figures, 1 table, 2 algorithms.

Key Result

Proposition 1

The distribution pdf in Example exm:dist is heavy-tailed, has bounded first absolute moment, but does not have any moment greater than one.

Figures (2)

  • Figure 1: Average relative optimality $\log_{10} ( f(\boldsymbol{x}^t) - f(\boldsymbol{x}^*)) / (f(\boldsymbol{x}^0) - f(\boldsymbol{x}^*))$ out of 10 runs in network and server-client cases, and network graph, from left to right.
  • Figure :

Theorems & Definitions (31)

  • Remark 1
  • Example 1
  • Proposition 1
  • Theorem 1
  • Remark 2
  • Corollary 1
  • Remark 3
  • Lemma 1
  • proof
  • Lemma 2
  • ...and 21 more