Improved Quantization Strategies for Managing Heavy-tailed Gradients in Distributed Learning
Guangfeng Yan, Tan Li, Yuanzhang Xiao, Hanxu Hou, Linqi Song
TL;DR
This work tackles the communication bottleneck in distributed SGD by targeting heavy-tailed gradient distributions. It introduces a two-stage quantization framework that first truncates extreme gradient values with a threshold $\alpha$ and then applies quantization with a density $\lambda_s(g)$, providing unbiasedness guarantees and an explicit convergence bound that separates truncation bias from quantization variance. By assuming a power-law tail for gradient distributions, the authors derive principled methods to set $\alpha$ and design $\lambda_s(g)$, including uniform, nonuniform, and truncated bi-scaled schemes (TUQSGD, TNQSGD, TBQSGD), and prove corresponding convergence guarantees. Empirical results on MNIST with 8 clients show that truncation-based schemes outperform standard QSGD/NQSGD under fixed communication budgets, achieving higher test accuracy and better trading off communication with learning performance. Overall, the method offers a theoretically grounded path to robust, communication-efficient distributed learning in the presence of heavy-tailed gradient statistics.
Abstract
Gradient compression has surfaced as a key technique to address the challenge of communication efficiency in distributed learning. In distributed deep learning, however, it is observed that gradient distributions are heavy-tailed, with outliers significantly influencing the design of compression strategies. Existing parameter quantization methods experience performance degradation when this heavy-tailed feature is ignored. In this paper, we introduce a novel compression scheme specifically engineered for heavy-tailed gradients, which effectively combines gradient truncation with quantization. This scheme is adeptly implemented within a communication-limited distributed Stochastic Gradient Descent (SGD) framework. We consider a general family of heavy-tail gradients that follow a power-law distribution, we aim to minimize the error resulting from quantization, thereby determining optimal values for two critical parameters: the truncation threshold and the quantization density. We provide a theoretical analysis on the convergence error bound under both uniform and non-uniform quantization scenarios. Comparative experiments with other benchmarks demonstrate the effectiveness of our proposed method in managing the heavy-tailed gradients in a distributed learning environment.
