Table of Contents
Fetching ...

Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training

Qinglun Li, Yingqi Liu, Miao Zhang, Xiaochun Cao, Quanjun Yin, Li Shen

TL;DR

This work analyzes decentralized stochastic gradient descent with multiple gossip steps (DSGD-MGS) through a stability-based generalization lens, revealing that MGS exponentially reduces the optimization error and tightens the generalization bound, yet a fundamental gap to centralized mini-batch SGD remains as the number of gossip steps grows. It introduces $l_2$ on-average model stability to derive generalization and excess-error bounds in non-convex settings without assuming bounded gradients, and provides a unified framework that characterizes how learning rate, data heterogeneity, node count, per-node sample size, and topology influence DSGD-MGS generalization. The authors further demonstrate, both theoretically and empirically on CIFAR datasets, that increasing gossip steps $Q$ yields exponential improvements in the bounds, but the centralized performance limit cannot be reached by MGS alone. The work also extends the analysis to consensus errors and mini-batch scenarios, offering actionable guidelines for hyperparameter tuning in decentralized training and advancing the theoretical understanding of how communication topology and data distribution shape generalization.

Abstract

Decentralized training removes the centralized server, making it a communication-efficient approach that can significantly improve training efficiency, but it often suffers from degraded performance compared to centralized training. Multi-Gossip Steps (MGS) serve as a simple yet effective bridge between decentralized and centralized training, significantly reducing experiment performance gaps. However, the theoretical reasons for its effectiveness and whether this gap can be fully eliminated by MGS remain open questions. In this paper, we derive upper bounds on the generalization error and excess error of MGS using stability analysis, systematically answering these two key questions. 1). Optimization Error Reduction: MGS reduces the optimization error bound at an exponential rate, thereby exponentially tightening the generalization error bound and enabling convergence to better solutions. 2). Gap to Centralization: Even as MGS approaches infinity, a non-negligible gap in generalization error remains compared to centralized mini-batch SGD ($\mathcal{O}(T^{\frac{cβ}{cβ+1}}/{n m})$ in centralized and $\mathcal{O}(T^{\frac{2cβ}{2cβ+2}}/{n m^{\frac{1}{2cβ+2}}})$ in decentralized). Furthermore, we provide the first unified analysis of how factors like learning rate, data heterogeneity, node count, per-node sample size, and communication topology impact the generalization of MGS under non-convex settings without the bounded gradients assumption, filling a critical theoretical gap in decentralized training. Finally, promising experiments on CIFAR datasets support our theoretical findings.

Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training

TL;DR

This work analyzes decentralized stochastic gradient descent with multiple gossip steps (DSGD-MGS) through a stability-based generalization lens, revealing that MGS exponentially reduces the optimization error and tightens the generalization bound, yet a fundamental gap to centralized mini-batch SGD remains as the number of gossip steps grows. It introduces $l_2$ on-average model stability to derive generalization and excess-error bounds in non-convex settings without assuming bounded gradients, and provides a unified framework that characterizes how learning rate, data heterogeneity, node count, per-node sample size, and topology influence DSGD-MGS generalization. The authors further demonstrate, both theoretically and empirically on CIFAR datasets, that increasing gossip steps $Q$ yields exponential improvements in the bounds, but the centralized performance limit cannot be reached by MGS alone. The work also extends the analysis to consensus errors and mini-batch scenarios, offering actionable guidelines for hyperparameter tuning in decentralized training and advancing the theoretical understanding of how communication topology and data distribution shape generalization.

Abstract

Decentralized training removes the centralized server, making it a communication-efficient approach that can significantly improve training efficiency, but it often suffers from degraded performance compared to centralized training. Multi-Gossip Steps (MGS) serve as a simple yet effective bridge between decentralized and centralized training, significantly reducing experiment performance gaps. However, the theoretical reasons for its effectiveness and whether this gap can be fully eliminated by MGS remain open questions. In this paper, we derive upper bounds on the generalization error and excess error of MGS using stability analysis, systematically answering these two key questions. 1). Optimization Error Reduction: MGS reduces the optimization error bound at an exponential rate, thereby exponentially tightening the generalization error bound and enabling convergence to better solutions. 2). Gap to Centralization: Even as MGS approaches infinity, a non-negligible gap in generalization error remains compared to centralized mini-batch SGD ( in centralized and in decentralized). Furthermore, we provide the first unified analysis of how factors like learning rate, data heterogeneity, node count, per-node sample size, and communication topology impact the generalization of MGS under non-convex settings without the bounded gradients assumption, filling a critical theoretical gap in decentralized training. Finally, promising experiments on CIFAR datasets support our theoretical findings.

Paper Structure

This paper contains 33 sections, 11 theorems, 88 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

Let $A$ be $l_2$ on-average model $\varepsilon$-stable. Let $\gamma > 0$. Then, if $\ell(\cdot;z)$ is nonnegative and is $\beta$-smoothness for all $z\in\mathcal{Z}$, we have

Figures (7)

  • Figure 1: Under ring topology, DSGD-MGS with 20 gossip steps still shows significant performance gaps versus Mini-batch SGD in both training loss and test accuracy (LeNet on CIFAR-10, Dir 0.3, 50 nodes).
  • Figure 2: A comparison of the $l_2$ weight distance and Loss distance (i.e. test loss - train loss) for the DSGD-MGS algorithm on the cifar10 dataset.
  • Figure 3: The test loss of the DSGD-MGS algorithm on the cifar10 test dataset.
  • Figure 4: A comparison of the $l_2$ weight distance and Loss distance (i.e. test loss - train loss) for the DSGD-MGS algorithm on the cifar10 dataset with centralized methods.
  • Figure 5: The accuracy of the DSGD-MGS algorithm on the cifar10 test dataset.
  • ...and 2 more figures

Theorems & Definitions (26)

  • Definition 1: $l_2$ on-average model stability
  • Lemma 1: Generalization via on-average model stability lei2020fine
  • Definition 2: Gossip Matrix
  • Remark 1
  • Lemma 2
  • Theorem 1: Stability for the DSGD-MGS
  • Theorem 2: Optimization error of DSGD-MGS
  • Theorem 3: Generalization error of DSGD-MGS
  • Remark 2: Optimization Error Reduction
  • Remark 3: Gap to Centralization
  • ...and 16 more