Table of Contents
Fetching ...

Unbiased and Sign Compression in Distributed Learning: Comparing Noise Resilience via SDEs

Enea Monzio Compagnoni, Rustem Islamov, Frank Norbert Proske, Aurelien Lucchi

TL;DR

The paper advances distributed optimization by deriving SDE-based models for DSGD, DCSGD, and DSignSGD to quantify how compression interacts with gradient noise. It shows that unbiased compression degrades convergence speed and quality in noisy settings, while sign-based compression remains robust to large, heavy-tailed noise and even enables linear speedups. The authors introduce practical scaling laws to adjust learning rates, batch sizes, and agent counts to preserve DSGD performance under compression and validate these insights across diverse architectures and tasks. These results offer principled guidelines for deploying compression-aware distributed optimizers in real-world systems and illuminate why sign-based methods may outperform unbiased schemes in noisy, large-scale settings.

Abstract

Distributed methods are essential for handling machine learning pipelines comprising large-scale models and datasets. However, their benefits often come at the cost of increased communication overhead between the central server and agents, which can become the main bottleneck, making training costly or even unfeasible in such systems. Compression methods such as quantization and sparsification can alleviate this issue. Still, their robustness to large and heavy-tailed gradient noise, a phenomenon sometimes observed in language modeling, remains poorly understood. This work addresses this gap by analyzing Distributed Compressed SGD (DCSGD) and Distributed SignSGD (DSignSGD) using stochastic differential equations (SDEs). Our results show that DCSGD with unbiased compression is more vulnerable to noise in stochastic gradients, while DSignSGD remains robust, even under large and heavy-tailed noise. Additionally, we propose new scaling rules for hyperparameter tuning to mitigate performance degradation due to compression. These findings are empirically validated across multiple deep learning architectures and datasets, providing practical recommendations for distributed optimization.

Unbiased and Sign Compression in Distributed Learning: Comparing Noise Resilience via SDEs

TL;DR

The paper advances distributed optimization by deriving SDE-based models for DSGD, DCSGD, and DSignSGD to quantify how compression interacts with gradient noise. It shows that unbiased compression degrades convergence speed and quality in noisy settings, while sign-based compression remains robust to large, heavy-tailed noise and even enables linear speedups. The authors introduce practical scaling laws to adjust learning rates, batch sizes, and agent counts to preserve DSGD performance under compression and validate these insights across diverse architectures and tasks. These results offer principled guidelines for deploying compression-aware distributed optimizers in real-world systems and illuminate why sign-based methods may outperform unbiased schemes in noisy, large-scale settings.

Abstract

Distributed methods are essential for handling machine learning pipelines comprising large-scale models and datasets. However, their benefits often come at the cost of increased communication overhead between the central server and agents, which can become the main bottleneck, making training costly or even unfeasible in such systems. Compression methods such as quantization and sparsification can alleviate this issue. Still, their robustness to large and heavy-tailed gradient noise, a phenomenon sometimes observed in language modeling, remains poorly understood. This work addresses this gap by analyzing Distributed Compressed SGD (DCSGD) and Distributed SignSGD (DSignSGD) using stochastic differential equations (SDEs). Our results show that DCSGD with unbiased compression is more vulnerable to noise in stochastic gradients, while DSignSGD remains robust, even under large and heavy-tailed noise. Additionally, we propose new scaling rules for hyperparameter tuning to mitigate performance degradation due to compression. These findings are empirically validated across multiple deep learning architectures and datasets, providing practical recommendations for distributed optimization.

Paper Structure

This paper contains 69 sections, 47 theorems, 110 equations, 13 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.2

The SDE of DSGD is where $\hat{\Sigma}(x)\coloneqq \frac{1}{N} \sum_{i=1}^{N} \Sigma_i(x)$ is the average of the covariance matrices of the $N$ agents.

Figures (13)

  • Figure 1: Empirical validation that the trajectories of the SDEs match those of the respective algorithm averaged over 500 runs - DSGD (Theorem \ref{['thm:DSGD_Theorem']}) on a Rosenbrock function ( Upper-Left); DCSGD (Theorem \ref{['thm:DCSGD_Theorem']}) with Rand-$k$ on an Embedded Saddle ( Upper-Right); DSignSGD on a Convex Quadratic: As per Theorem \ref{['thm:DSignSGD_Theorem']}, the dynamics of DSignSGD can be partitioned into three phases --- Not only the "Full" SDE is a faithful model for DSignSGD through the whole dynamics, but so are the ODE of Phase 1 and the SDE of Phase 3 in their respective phases. Importantly, the bound that characterizes Phase 2 captures the dynamics as prescribed ( Bottom-Left); The SDEs and the optimizers move at the same speed — DCSGD on an MLP ( Bottom-Right). For details, see Appendix \ref{['app:Exper']}.
  • Figure 2: Validation of Scaling Rules: Consistently with Prop. \ref{['prop:DCSGD_RecoverLaws_Main']}, DCSGD run with hyperparameters that follow the scaling rules listed in Table \ref{['tab:DCSGD_ScalingLaws']} (marked in green in the legends) recover the performance of DSGD$(\eta, B, N)$. Those that do not (marked in red) fail to do so. On the left, we plot the training loss of a ViT for some rules while on the right we do the same for a ResNet. Details are in Appendix \ref{['app:Exper']}.
  • Figure 3: Validation of Bounds: As prescribed by Theorem \ref{['thm:DSignSGD_Convergence']}, the bounds match or dominate the empirical loss of DSignSGD on a quadratic convex function in all three phases ( Left); As per Theorem \ref{['thm:DSignSGD_Convergence']}, DSignSGD achieves linear speedup: More agents imply lower loss ( Right);
  • Figure 4: Validation of Scaling Rules: Consistently with Proposition \ref{['prop:DSignSGD_ScalingLaws']}, DSignSGD run with hyperparameters that follow our scaling rule (in green in the legends) recover the performance of DSignSGD$(\eta, B, N)$. The one that does not (in red) fails to do so. On the left, we plot the training loss of a ViT for some rules while on the right we do the same for a ResNet. Details in Appendix \ref{['app:Exper']}.
  • Figure 5: Empirical validation of the insights derived from Theorem \ref{['thm:DCSGD_Convergence']} and Theorem \ref{['thm:DSignSGD_Convergence']}: i) DCSGD cannot handle fat noise - The loss diverges if $\nu=1$ and is non-stationary if $\nu=2$ ( Upper-Left); ii) The loss diverges more and more for larger noise ( Upper-Right); DSignSGD converges even when the noise is fat, although fatter noise implies less optimality ( Bottom-Left); DSignSGD never diverges even when noise becomes increasingly larger ( Bottom-Right).
  • ...and 8 more figures

Theorems & Definitions (84)

  • Definition 3.1: Weak Approximation
  • Theorem 3.2: Informal Statement of Theorem \ref{['thm:DSGD_SDE']}
  • Theorem 3.3
  • Theorem 3.4
  • Proposition 3.5
  • Theorem 3.6: Informal Statement of Theorem \ref{['thm:DCSGD_SDE']}
  • Theorem 3.7
  • Theorem 3.8
  • Proposition 3.9
  • Theorem 3.10: Informal Statement of Theorem \ref{['thm:DSignSGD_SDE']}
  • ...and 74 more