Table of Contents
Fetching ...

Improved Stability and Generalization Guarantees of the Decentralized SGD Algorithm

Batiste Le Bars, Aurélien Bellet, Marc Tommasi, Kevin Scaman, Giovanni Neglia

TL;DR

It is shown, for convex, strongly convex and non-convex functions, that D-SGD can always recover generalization bounds analogous to those of classical SGD, suggesting that the choice of graph does not matter, and a poorly-connected graph can even be beneficial for generalization.

Abstract

This paper presents a new generalization error analysis for Decentralized Stochastic Gradient Descent (D-SGD) based on algorithmic stability. The obtained results overhaul a series of recent works that suggested an increased instability due to decentralization and a detrimental impact of poorly-connected communication graphs on generalization. On the contrary, we show, for convex, strongly convex and non-convex functions, that D-SGD can always recover generalization bounds analogous to those of classical SGD, suggesting that the choice of graph does not matter. We then argue that this result is coming from a worst-case analysis, and we provide a refined optimization-dependent generalization bound for general convex functions. This new bound reveals that the choice of graph can in fact improve the worst-case bound in certain regimes, and that surprisingly, a poorly-connected graph can even be beneficial for generalization.

Improved Stability and Generalization Guarantees of the Decentralized SGD Algorithm

TL;DR

It is shown, for convex, strongly convex and non-convex functions, that D-SGD can always recover generalization bounds analogous to those of classical SGD, suggesting that the choice of graph does not matter, and a poorly-connected graph can even be beneficial for generalization.

Abstract

This paper presents a new generalization error analysis for Decentralized Stochastic Gradient Descent (D-SGD) based on algorithmic stability. The obtained results overhaul a series of recent works that suggested an increased instability due to decentralization and a detrimental impact of poorly-connected communication graphs on generalization. On the contrary, we show, for convex, strongly convex and non-convex functions, that D-SGD can always recover generalization bounds analogous to those of classical SGD, suggesting that the choice of graph does not matter. We then argue that this result is coming from a worst-case analysis, and we provide a refined optimization-dependent generalization bound for general convex functions. This new bound reveals that the choice of graph can in fact improve the worst-case bound in certain regimes, and that surprisingly, a poorly-connected graph can even be beneficial for generalization.
Paper Structure (27 sections, 15 theorems, 80 equations, 1 figure, 1 table, 1 algorithm)

This paper contains 27 sections, 15 theorems, 80 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Lemma 2.2

(Generalization via on-average model stability lei2020fine). Let $A$ be on-average model $\varepsilon$-stable. Then, if $\ell(\cdot;z)$ is $L$-Lipschitz for all $z\in{\cal Z}$ (see Assumption ass:lipschitz), we have $|\mathbb{E}_{A,S}[R(A(S)) - R_S(A(S))]| \leq L \varepsilon$.

Figures (1)

  • Figure 1: Empirical generalization error, as a function of the number of iterations $T$, and for different communication graphs. Constant stepsize $\eta=0.03$. (Left) Low-noise regime with $\sigma\simeq 0$. (Right) Noisy regime with $\sigma > 0$. See Appendix \ref{['app:exps']} for experimental details.

Theorems & Definitions (32)

  • Definition 2.1
  • Lemma 2.2
  • Remark 2.3
  • Remark 2.6
  • Theorem 3.1
  • proof : Sketch of proof (see Appendix \ref{['app:convex']} for details)
  • Remark 3.2
  • Theorem 3.3
  • Theorem 4.1
  • proof : Sketch of proof (see Appendix \ref{['app:non-convex']} for details)
  • ...and 22 more