Improved Quantization Strategies for Managing Heavy-tailed Gradients in Distributed Learning

Guangfeng Yan; Tan Li; Yuanzhang Xiao; Hanxu Hou; Linqi Song

Improved Quantization Strategies for Managing Heavy-tailed Gradients in Distributed Learning

Guangfeng Yan, Tan Li, Yuanzhang Xiao, Hanxu Hou, Linqi Song

TL;DR

This work tackles the communication bottleneck in distributed SGD by targeting heavy-tailed gradient distributions. It introduces a two-stage quantization framework that first truncates extreme gradient values with a threshold $\alpha$ and then applies quantization with a density $\lambda_s(g)$, providing unbiasedness guarantees and an explicit convergence bound that separates truncation bias from quantization variance. By assuming a power-law tail for gradient distributions, the authors derive principled methods to set $\alpha$ and design $\lambda_s(g)$, including uniform, nonuniform, and truncated bi-scaled schemes (TUQSGD, TNQSGD, TBQSGD), and prove corresponding convergence guarantees. Empirical results on MNIST with 8 clients show that truncation-based schemes outperform standard QSGD/NQSGD under fixed communication budgets, achieving higher test accuracy and better trading off communication with learning performance. Overall, the method offers a theoretically grounded path to robust, communication-efficient distributed learning in the presence of heavy-tailed gradient statistics.

Abstract

Gradient compression has surfaced as a key technique to address the challenge of communication efficiency in distributed learning. In distributed deep learning, however, it is observed that gradient distributions are heavy-tailed, with outliers significantly influencing the design of compression strategies. Existing parameter quantization methods experience performance degradation when this heavy-tailed feature is ignored. In this paper, we introduce a novel compression scheme specifically engineered for heavy-tailed gradients, which effectively combines gradient truncation with quantization. This scheme is adeptly implemented within a communication-limited distributed Stochastic Gradient Descent (SGD) framework. We consider a general family of heavy-tail gradients that follow a power-law distribution, we aim to minimize the error resulting from quantization, thereby determining optimal values for two critical parameters: the truncation threshold and the quantization density. We provide a theoretical analysis on the convergence error bound under both uniform and non-uniform quantization scenarios. Comparative experiments with other benchmarks demonstrate the effectiveness of our proposed method in managing the heavy-tailed gradients in a distributed learning environment.

Improved Quantization Strategies for Managing Heavy-tailed Gradients in Distributed Learning

TL;DR

and then applies quantization with a density

, providing unbiasedness guarantees and an explicit convergence bound that separates truncation bias from quantization variance. By assuming a power-law tail for gradient distributions, the authors derive principled methods to set

and design

, including uniform, nonuniform, and truncated bi-scaled schemes (TUQSGD, TNQSGD, TBQSGD), and prove corresponding convergence guarantees. Empirical results on MNIST with 8 clients show that truncation-based schemes outperform standard QSGD/NQSGD under fixed communication budgets, achieving higher test accuracy and better trading off communication with learning performance. Overall, the method offers a theoretically grounded path to robust, communication-efficient distributed learning in the presence of heavy-tailed gradient statistics.

Abstract

Paper Structure (15 sections, 5 theorems, 46 equations, 5 figures, 1 algorithm)

This paper contains 15 sections, 5 theorems, 46 equations, 5 figures, 1 algorithm.

Introduction
Problem Formulation
Truncated Quantizer for Heavy-Tail Gradients
Two-Stage Quantizer
Performance Analysis
Optimal Quantizer Parameter Design
Truncated Uniform Quantization
Truncated Nonuniform Quantization
Experiments
Conclusion
Appendix
Proof of Lemma 1
Proof of Lemma 2
Proof of Theorem 1
Truncated BiScaled Quantization

Key Result

Lemma 1

For a truncated gradient element $g\in [a_1,a_2]$ with probability density function $p_g(\cdot)$, given the quantization points $\mathcal{L}=\{l_0, l_1,...,l_s\}$, the nonuniform stochastic quantization satisfies: and where $P_k = \int_{l_{k-1}}^{l_k}p_g(x)\mathrm{d}x$ and $|\Delta_k| = l_k - l_{k-1}$.

Figures (5)

Figure 1: The probability density of gradient computed with LeNet on MNIST.. (The variance of the Laplace distribution is taken as the same value as the gradient variance.)
Figure 2: Two-Stage Quantizer (With truncation threshold $[-\alpha, \alpha]$ and quantization bit $b=3$ and quantization level $s=7$.)
Figure 3: Model performance of different algorithms.
Figure 4: Communication-learning tradeoff of different algorithms.
Figure 5: Truncated BiScaled Quantization.

Theorems & Definitions (6)

Lemma 1: Unbiasness and Bounded Variance
Lemma 2
Definition 1: Power-law distribution clauset2009power
Theorem 1
Theorem 2
Theorem 3: Convergence Performance of TBQSGD

Improved Quantization Strategies for Managing Heavy-tailed Gradients in Distributed Learning

TL;DR

Abstract

Improved Quantization Strategies for Managing Heavy-tailed Gradients in Distributed Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (6)