DIGEST: Fast and Communication Efficient Decentralized Learning with Local Updates

Peyman Gholami; Hulya Seferoglu

DIGEST: Fast and Communication Efficient Decentralized Learning with Local Updates

Peyman Gholami, Hulya Seferoglu

TL;DR

The paper tackles the high communication costs and potential slow convergence of decentralized learning by introducing DIGEST, an asynchronous framework that fuses Gossip-like information spreading with occasional global-model exchanges driven by local-SGD. It supports both single-stream and multi-stream modes, enabling a tunable trade-off between convergence speed and communication overhead, and provides convergence guarantees for iid and non-iid data across network topologies. Empirical results on logistic regression and ResNet-20 demonstrate that multi-stream DIGEST often outperforms baselines in non-iid settings while maintaining competitive performance in iid scenarios, with clear speed-up gains. Overall, DIGEST offers a practical, topology-agnostic approach to scalable decentralized learning without a central server.

Abstract

Two widely considered decentralized learning algorithms are Gossip and random walk-based learning. Gossip algorithms (both synchronous and asynchronous versions) suffer from high communication cost, while random-walk based learning experiences increased convergence time. In this paper, we design a fast and communication-efficient asynchronous decentralized learning mechanism DIGEST by taking advantage of both Gossip and random-walk ideas, and focusing on stochastic gradient descent (SGD). DIGEST is an asynchronous decentralized algorithm building on local-SGD algorithms, which are originally designed for communication efficient centralized learning. We design both single-stream and multi-stream DIGEST, where the communication overhead may increase when the number of streams increases, and there is a convergence and communication overhead trade-off which can be leveraged. We analyze the convergence of single- and multi-stream DIGEST, and prove that both algorithms approach to the optimal solution asymptotically for both iid and non-iid data distributions. We evaluate the performance of single- and multi-stream DIGEST for logistic regression and a deep neural network ResNet20. The simulation results confirm that multi-stream DIGEST has nice convergence properties; i.e., its convergence time is better than or comparable to the baselines in iid setting, and outperforms the baselines in non-iid setting.

DIGEST: Fast and Communication Efficient Decentralized Learning with Local Updates

TL;DR

Abstract

Paper Structure (18 sections, 12 theorems, 67 equations, 6 figures, 1 table, 3 algorithms)

This paper contains 18 sections, 12 theorems, 67 equations, 6 figures, 1 table, 3 algorithms.

Introduction
Related Work
Design of DIGEST
Preliminaries
Single-Stream DIGEST
Overview
Algorithm Design
Multi-Stream DIGEST
Tree Construction and Multiple Streams
Algorithm Design
Convergence Analysis of DIGEST
Evaluation of DIGEST
Logistic Regression
Deep Neural Network (DNN)
Speed-up
...and 3 more sections

Key Result

Theorem 4.1

Let assumptions as1-as6 hold, with a constant and small enough learning rate $\eta \leq \frac{1}{30LA}$ (potentially depending on $T$), the convergence rate of single- and multi-stream DIGEST is as follows: Non-convex:$\frac{1}{T}\sum_{t=0}^{T-1}\mathop{\mathrm{\mathbb{E}}}\nolimits\| \nabla f(\hat{ where $\hat{\mathbf{x}}_T = \sum_{v=1}^{V} \sum_{t=0}^{T-1} \frac{D_{v}}{D}\mathbf{x}^v_{t}$. Conve

Figures (6)

Figure 1: DIGEST in perspective as compared to existing decentralized learning algorithms; (a) synchronous Gossip, asynchronous Gossip, and random-walk. Note that "$\nabla$" represents a model update. "Xmit" represents the transmission of a model from a node to one of its neighbors. "Recv" represents the communication duration while receiving model updates from all of a node's neighbors. "A" represents model aggregation. $\mathbf{x}^v_{t}$ shows the local model of node $v$ at iteration $t$. For random walk algorithm, the global model iterates are denoted as $\mathbf{x}_t$. We note that the absence of blue boxes in all figures means that nodes do not continue their computations. On the other hand, the absence of red boxes means that there is no communication among neighboring nodes. We also note that communication ("Xmit") and computation ("$\nabla$") are parallel in DIGEST and asynchronous Gossip, but aggregation ("A") and computation are sequential. The figure shows them as parallel tasks for the sake of easier presentation and considering that the duration of aggregation ("A") is negligible as compared to communication ("Xmit") and computation ("$\nabla$").
Figure 3: Example multi-stream DIGEST.
Figure 4: Convergence results for MNIST dataset in terms of global loss over wall-clock time.
Figure 5: Convergence results for w8a dataset in terms of global loss over wall-clock time.
Figure 6: Convergence results for CIFAR-10 dataset in terms of global loss over wall-clock time.
...and 1 more figures

Theorems & Definitions (17)

Theorem 4.1
Corollary 4.1.1
Corollary 4.1.2
Corollary 4.1.3
Lemma 4.2
Lemma 4.3
Lemma 4.4: Bounding deviation
Lemma 1
proof
Lemma 2
...and 7 more

DIGEST: Fast and Communication Efficient Decentralized Learning with Local Updates

TL;DR

Abstract

DIGEST: Fast and Communication Efficient Decentralized Learning with Local Updates

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (17)