Get More for Less in Decentralized Learning Systems

Akash Dhasade; Anne-Marie Kermarrec; Rafael Pires; Rishi Sharma; Milos Vujasinovic; Jeffrey Wigger

Get More for Less in Decentralized Learning Systems

Akash Dhasade, Anne-Marie Kermarrec, Rafael Pires, Rishi Sharma, Milos Vujasinovic, Jeffrey Wigger

TL;DR

This paper tackles the high communication cost of decentralized learning (DL) with large neural networks by introducing JWINS, a wavelet-domain sparsification framework that shares only a subset of parameters under a randomized cut-off. JWINS ranks parameters through a wavelet-based accumulation of model changes, and uses a randomized sharing rate to balance information exchange and network load, complemented by metadata compression. Empirically, JWINS achieves near full-sharing accuracy on non-IID data across multiple tasks while reducing transmitted data by up to $64\%$, and it outperforms CHOCO-SGD by up to $4\times$ in network savings and wall-clock time at low budgets. The results demonstrate JWINS’ scalability to hundreds of nodes, robustness to topology changes, and broad applicability across CNNs, LSTMs, and embeddings, with clear avenues for future theoretical convergence guarantees and adaptive parameter-type ranking.

Abstract

Decentralized learning (DL) systems have been gaining popularity because they avoid raw data sharing by communicating only model parameters, hence preserving data confidentiality. However, the large size of deep neural networks poses a significant challenge for decentralized training, since each node needs to exchange gigabytes of data, overloading the network. In this paper, we address this challenge with JWINS, a communication-efficient and fully decentralized learning system that shares only a subset of parameters through sparsification. JWINS uses wavelet transform to limit the information loss due to sparsification and a randomized communication cut-off that reduces communication usage without damaging the performance of trained models. We demonstrate empirically with 96 DL nodes on non-IID datasets that JWINS can achieve similar accuracies to full-sharing DL while sending up to 64% fewer bytes. Additionally, on low communication budgets, JWINS outperforms the state-of-the-art communication-efficient DL algorithm CHOCO-SGD by up to 4x in terms of network savings and time.

Get More for Less in Decentralized Learning Systems

TL;DR

, and it outperforms CHOCO-SGD by up to

in network savings and wall-clock time at low budgets. The results demonstrate JWINS’ scalability to hundreds of nodes, robustness to topology changes, and broad applicability across CNNs, LSTMs, and embeddings, with clear avenues for future theoretical convergence guarantees and adaptive parameter-type ranking.

Abstract

Paper Structure (43 sections, 4 equations, 10 figures, 1 table)

This paper contains 43 sections, 4 equations, 10 figures, 1 table.

Introduction
Background
Decentralized learning
Objective.
Decentralized training.
Communication compression
Gradient sparsification.
Sparsification and metadata.
Accumulation.
Parameter sparsification.
Random sampling.
Jwins
Jwins parameter ranking
Parameter and gradient representation.
Accumulation in the wavelet domain.
...and 28 more sections

Figures (10)

Figure 1: Jwins consists of four main modules that produce a smaller partial model: (i) wavelet transform and (ii) accumulation gives importance scores to parameters; (iii) randomized cut-off enables nodes to randomly choose the fraction of shared parameters; and (iv) metadata compression is used to practically nullify the overheads of metadata when sharing sparsified models.
Figure 2: Mean squared error between the original and reconstructed model when sparsifying parameters using the given algorithms. The plot exhibits the information loss due to sparsification.
Figure 3: Randomized cut-off in Jwins. Chart on the left depicts the random percentages selected by Jwins' nodes in a typical communication round. Chart on the right shows the average sharing percentage across nodes over communication rounds.
Figure 4: Learning curves and network usage for Jwins compared to full-shar-ing and random sampling when run for fixed rounds. Jwins achieves as good test accuracy and test loss as full-shar-ing across most datasets (row-1 and row-2), while requiring significantly less network transfers per node (row-3). Results are further quantified in \ref{['tab:performanceBasic']}.
Figure 5: Learning curves and network usage for Jwins compared to full-shar-ing and random sampling when run until convergence. In this scenario, the random sampling algorithm is run very long and identified with a target accuracy. Then Jwins and full-shar-ing are run until this accuracy is reached. Jwins reaches the target accuracy much faster than random sampling (row-1) while requiring 1.5$\times$ to 4$\times$ less network usage compared to random sampling (row-2).
...and 5 more figures

Get More for Less in Decentralized Learning Systems

TL;DR

Abstract

Get More for Less in Decentralized Learning Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (10)