Towards Faster Decentralized Stochastic Optimization with Communication Compression

Rustem Islamov; Yuan Gao; Sebastian U. Stich

Towards Faster Decentralized Stochastic Optimization with Communication Compression

Rustem Islamov, Yuan Gao, Sebastian U. Stich

TL;DR

This paper introduces MoTEF, a novel approach that integrates communication compression with Momentum Tracking and Error Feedback, and demonstrates that MoTEF achieves most of the desired properties, and significantly outperforms existing methods under arbitrary data heterogeneity.

Abstract

Communication efficiency has garnered significant attention as it is considered the main bottleneck for large-scale decentralized Machine Learning applications in distributed and federated settings. In this regime, clients are restricted to transmitting small amounts of quantized information to their neighbors over a communication graph. Numerous endeavors have been made to address this challenging problem by developing algorithms with compressed communication for decentralized non-convex optimization problems. Despite considerable efforts, the current results suffer from various issues such as non-scalability with the number of clients, requirements for large batches, or bounded gradient assumption. In this paper, we introduce MoTEF, a novel approach that integrates communication compression with Momentum Tracking and Error Feedback. Our analysis demonstrates that MoTEF achieves most of the desired properties, and significantly outperforms existing methods under arbitrary data heterogeneity. We provide numerical experiments to validate our theoretical findings and confirm the practical superiority of MoTEF.

Towards Faster Decentralized Stochastic Optimization with Communication Compression

TL;DR

Abstract

Paper Structure (38 sections, 23 theorems, 120 equations, 10 figures, 1 table, 2 algorithms)

This paper contains 38 sections, 23 theorems, 120 equations, 10 figures, 1 table, 2 algorithms.

Introduction
Related works
Decentralized optimization and gradient tracking.
Momentum in distributed training.
Short history of Error Feedback.
Issues of Error Feedback in decentralized setting.
Problem setup
The Algorithms and Theoretical analysis
Notation
Convergence of MoTEF
Convergence of MoTEF-VR
Numerical experiments
Synthetic least squares problem
Increasing the number of nodes.
Effect of the momentum parameter.
...and 23 more sections

Key Result

Lemma 1

Let Assumptions asmp:smoothness and asmp:bounded_variance hold. Then there exist absolute constants $c_{\gamma}, c_{\lambda}, c_{\eta},$ and $\tau \le 1$ such that if we set stepsizes $\gamma = c_{\gamma}\alpha\rho, \lambda = c_{\lambda}\alpha\rho^3\tau, \eta = c_{\eta}L^{-1}\alpha\rho^3\tau$ such t

Figures (10)

Figure 1: (a) BEER with different number of clients $n$; (b) MoTEF with different number of clients $n$; (c) MoTEF with different momentum parameter $\lambda$. MoTEF's error decreases as the number of clients increases, while the error of BEER does not. The error of MoTEF increases as the momentum parameter increases. In all cases, we set $d=20,\zeta=10,\sigma=10$, and apply Top-K compressor with $\alpha=K/d=0.1$. We fix the parameters $\gamma=0.1,\eta=0.0005,\lambda=0.005$, and $n=16,$ if the opposite is not stated. (d) The number of iterations for MoTEF to reach an error of $10^{-3}$, as compared to the theoretical prediction $\mathcal{O}(1/\rho^3)$. We see that the convergence of MoTEF is much less sensitive to $\rho$ than the theoretical prediction.
Figure 2: Performance of MoTEF, BEER and CHOCO-SGD with varying data heterogeneity $\zeta$ and fixed noise level $\sigma=5$. We see that MoTEF outperforms BEER and CHOCO-SGD in all cases, and is not affected by the data heterogeneity, while CHOCO-SGD's performance degrades as $\zeta$ increases. We set $d=20,n=4$ and apply Top-K compressor with $\alpha=K/d=0.1$. We set the target error to be $0.01$.
Figure 3: Comparison of MoTEF, BEER, Choco-SGD, DSGD, D2 in terms of communication complexity on logistic regression with non-convex regularization on ring topology with batch size $5$ and gsgd${}_b$ compressor. We observe that MoTEF outperforms other algorithms in terms of both test accuracy and gradient norm.
Figure 4: Performance of MoTEF changing of network topology tested on logistic regression with non-convex regularization. We set $n=40, \lambda=0.05,$ and batch size $100$. We observe that MoTEF is very robust against changing network topologies for practical problems.
Figure 5: Comparison of MoTEF, BEER, Choco-SGD, DSGD, D2 in terms of communication complexity on training MLP with $1$ hidden layer. We observe that MoTEF outperforms the other methods.
...and 5 more figures

Theorems & Definitions (45)

Definition 1
Lemma 1: Descent of the Lyapunov function
Theorem 1: Convergence of MoTEF
Remark 2
Theorem 2: Convergence of MoTEF
Remark 3
Theorem 3: Convergence of MoTEF-VR
Remark 4
Lemma 5: Lemma B.2 from zhao2022beer
Lemma 6
...and 35 more

Towards Faster Decentralized Stochastic Optimization with Communication Compression

TL;DR

Abstract

Towards Faster Decentralized Stochastic Optimization with Communication Compression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (45)