Compressed and Sparse Models for Non-Convex Decentralized Learning

Andrew Campbell; Hang Liu; Leah Woldemariam; Anna Scaglione

Compressed and Sparse Models for Non-Convex Decentralized Learning

Andrew Campbell, Hang Liu, Leah Woldemariam, Anna Scaglione

TL;DR

Malcom-PSGD tackles high communication costs in decentralized non-convex learning by combining $\\ell_1$-induced sparsity with gradient compression via dithering-based quantization and vector source coding within a proximal SGD framework. It proves convergence for non-convex, compressed decentralized proximal SGD with constant stepsizes, achieving a rate of $\\mathcal{O}(1/\\sqrt{t})$ and a consensus rate of $\\mathcal{O}(1/t)$. The paper also provides a closed-form bit-rate analysis showing the encoding yields $R(L)=\\mathcal{O}(d e^{-1/L})$ and demonstrates about a 75% reduction in communication bits in experiments. Empirical results on MNIST and CIFAR-10 with ring and fully-connected topologies validate the theoretical findings and illustrate significant improvements in communication efficiency without sacrificing accuracy.

Abstract

Recent research highlights frequent model communication as a significant bottleneck to the efficiency of decentralized machine learning (ML), especially for large-scale and over-parameterized neural networks (NNs). To address this, we present Malcom-PSGD, a novel decentralized ML algorithm that combines gradient compression techniques with model sparsification. We promote model sparsity by adding $\ell_1$ regularization to the objective and present a decentralized proximal SGD method for training. Our approach employs vector source coding and dithering-based quantization for the compressed gradient communication of sparsified models. Our analysis demonstrates that Malcom-PSGD achieves a convergence rate of $\mathcal{O}(1/\sqrt{t})$ with respect to the iterations $t$, assuming a constant consensus and learning rate. This result is supported by our proof for the convergence of non-convex compressed Proximal SGD methods. Additionally, we conduct a bit analysis, providing a closed-form expression for the communication costs associated with Malcom-PSGD. Numerical results verify our theoretical findings and demonstrate that our method reduces communication costs by approximately $75\%$ when compared to the state-of-the-art.

Compressed and Sparse Models for Non-Convex Decentralized Learning

TL;DR

Malcom-PSGD tackles high communication costs in decentralized non-convex learning by combining

-induced sparsity with gradient compression via dithering-based quantization and vector source coding within a proximal SGD framework. It proves convergence for non-convex, compressed decentralized proximal SGD with constant stepsizes, achieving a rate of

and a consensus rate of

. The paper also provides a closed-form bit-rate analysis showing the encoding yields

and demonstrates about a 75% reduction in communication bits in experiments. Empirical results on MNIST and CIFAR-10 with ring and fully-connected topologies validate the theoretical findings and illustrate significant improvements in communication efficiency without sacrificing accuracy.

Abstract

regularization to the objective and present a decentralized proximal SGD method for training. Our approach employs vector source coding and dithering-based quantization for the compressed gradient communication of sparsified models. Our analysis demonstrates that Malcom-PSGD achieves a convergence rate of

with respect to the iterations

, assuming a constant consensus and learning rate. This result is supported by our proof for the convergence of non-convex compressed Proximal SGD methods. Additionally, we conduct a bit analysis, providing a closed-form expression for the communication costs associated with Malcom-PSGD. Numerical results verify our theoretical findings and demonstrate that our method reduces communication costs by approximately

when compared to the state-of-the-art.

Paper Structure (22 sections, 9 theorems, 61 equations, 3 figures, 1 table, 3 algorithms)

This paper contains 22 sections, 9 theorems, 61 equations, 3 figures, 1 table, 3 algorithms.

Introduction
Learning Sparse Models
Proposed Decentralized Algorithm
Convergence Analysis
Communication Bit Rate Analysis
Numerical Results
Conclusion
Appendix
The Quantization Scheme in (\ref{['eq:q(x)']}) Satisfies Lemma \ref{['lem:quant']}
Matrix Representation of Malcom-PSGD and Useful Lemmas
Further Discussions on the Encoding Algorithm
Proof of Theorem \ref{['th:bits']}
Proof of Theorem \ref{['th:converge']}
Lemma and Proposition Proofs
Proof of Lemma \ref{['th:consensus']}
...and 7 more sections

Key Result

Lemma 1

For any input ${\bm{p}}$, the received quantized result after the de-normalization $Q({\bm{p}}):\mathbb{R}^d\rightarrow\mathbb{R}^{d}$ satisfies that where $\tau\in (0,1)$ is a constant measuring the compression error bound. Specifically, we have $\tau=1+d/L^2$ for the quantizer in (eq:q(x)).

Figures (3)

Figure 2: All plots are over the "ring" topology. The top row corresponds to the ResNet setup over the CIFAR10 dataset while the bottom corresponds to the 3-layer FC. NN over MNIST. The left column contains the accuracy plots while the right contains the loss plots. The horizontal black dashed line refers to the accuracy cut off and the diamond-colored points are where the bits were sampled from. For the CIFAR10 setup there is an accuracy cut-off of 67.0 while the MNIST setup has a cut-off at 67.5.
Figure 3: All plots are over the fully connected topology. The top row corresponds to the ResNet setup over the CIFAR10 dataset while the bottom corresponds to the 3-layer FC. NN over MNIST. The left column contains the accuracy plots while the right contains the loss plots. The horizontal black dashed line refers to the accuracy cut off and the diamond-colored points are where the bits were sampled from. For the CIFAR10 setup, there is an accuracy cut-off of 70.6, while the MNIST setup has a cut-off at 95.3.
Figure 4: Left: The Ring-Like network topology. Circles denote the devices and edges denote connection links, where self-loops are omitted in the plot for brevity. Right: The corresponding mixing matrix $\bf W$.

Theorems & Definitions (14)

Lemma 1
Theorem 1
Theorem 2
Lemma 2
Lemma 3
proof
proof
Lemma 4
Lemma 5
Theorem 3
...and 4 more

Compressed and Sparse Models for Non-Convex Decentralized Learning

TL;DR

Abstract

Compressed and Sparse Models for Non-Convex Decentralized Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (14)