Table of Contents
Fetching ...

Improved High-probability Convergence Guarantees of Decentralized SGD

Aleksandar Armacki, Ali H. Sayed

TL;DR

The paper addresses high-probability convergence for decentralized SGD without requiring uniformly bounded gradients, under light-tailed noise. By carefully analyzing MGFs of both the gradient-related quantities and the consensus gap, it proves HP convergence for both non-convex and strongly convex costs with order-optimal rates. It demonstrates linear speed-up in the number of users and improved transient times, aligning HP guarantees with known MSE results while avoiding restrictive assumptions. The approach introduces novel tools, including an offset trick and variance-reduction type results for HP analysis, and discusses extensions to time-varying networks and heavier tails.

Abstract

Convergence in high-probability (HP) has been receiving increasing interest, due to its attractive properties, such as exponentially decaying tail bounds and strong guarantees for each individual run of an algorithm. While HP guarantees are extensively studied in centralized settings, much less is understood in the decentralized, networked setup. Existing HP studies in decentralized settings impose strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise, resulting in a significant gap between assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, even for vanilla Decentralized Stochastic Gradient Descent ($\mathtt{DSGD}$) algorithm. This is contrary to centralized settings, where it is known that $\mathtt{SGD}$ converges in HP under the same conditions on the cost function as needed to guarantee MSE convergence. Motivated by this observation, we revisit HP guarantees for $\mathtt{DSGD}$ in the presence of light-tailed noise. We show that $\mathtt{DSGD}$ converges in HP under the same conditions on the cost as in the MSE sense, removing uniformly bounded gradients and other restrictive assumptions, while simultaneously achieving order-optimal rates for both non-convex and strongly convex costs. Moreover, our improved analysis yields linear speed-up in the number of users, demonstrating that $\mathtt{DSGD}$ maintains strong performance in the HP sense and matches existing MSE guarantees. Our improved results stem from a careful analysis of the MGF of quantities of interest (norm-squared of gradient or optimality gap) and the MGF of the consensus gap between users' models. To achieve linear speed-up, we provide a novel result on the variance-reduction effect of decentralized methods in the HP sense and more fine-grained bounds on the MGF for strongly convex costs, which are both of independent interest.

Improved High-probability Convergence Guarantees of Decentralized SGD

TL;DR

The paper addresses high-probability convergence for decentralized SGD without requiring uniformly bounded gradients, under light-tailed noise. By carefully analyzing MGFs of both the gradient-related quantities and the consensus gap, it proves HP convergence for both non-convex and strongly convex costs with order-optimal rates. It demonstrates linear speed-up in the number of users and improved transient times, aligning HP guarantees with known MSE results while avoiding restrictive assumptions. The approach introduces novel tools, including an offset trick and variance-reduction type results for HP analysis, and discusses extensions to time-varying networks and heavier tails.

Abstract

Convergence in high-probability (HP) has been receiving increasing interest, due to its attractive properties, such as exponentially decaying tail bounds and strong guarantees for each individual run of an algorithm. While HP guarantees are extensively studied in centralized settings, much less is understood in the decentralized, networked setup. Existing HP studies in decentralized settings impose strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise, resulting in a significant gap between assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, even for vanilla Decentralized Stochastic Gradient Descent () algorithm. This is contrary to centralized settings, where it is known that converges in HP under the same conditions on the cost function as needed to guarantee MSE convergence. Motivated by this observation, we revisit HP guarantees for in the presence of light-tailed noise. We show that converges in HP under the same conditions on the cost as in the MSE sense, removing uniformly bounded gradients and other restrictive assumptions, while simultaneously achieving order-optimal rates for both non-convex and strongly convex costs. Moreover, our improved analysis yields linear speed-up in the number of users, demonstrating that maintains strong performance in the HP sense and matches existing MSE guarantees. Our improved results stem from a careful analysis of the MGF of quantities of interest (norm-squared of gradient or optimality gap) and the MGF of the consensus gap between users' models. To achieve linear speed-up, we provide a novel result on the variance-reduction effect of decentralized methods in the HP sense and more fine-grained bounds on the MGF for strongly convex costs, which are both of independent interest.

Paper Structure

This paper contains 27 sections, 25 theorems, 131 equations, 2 figures, 1 algorithm.

Key Result

Lemma 1

Let (A3) hold. If $\alpha_t \leq \frac{1}{2L}$, we have

Figures (2)

  • Figure 1: Performance of DSGD, in the MSE sense (left) and HP sense (right). We can see that DSGD achieves an exponential tail decay for all values of threshold $\varepsilon$. For the threshold $\varepsilon = 10^{-4}$, the tail probability starts decaying exponentially after approximately $t = 6000$ iterations, which is consistent with the MSE behaviour, where we can see that DSGD takes around the same number of iterations to reach the average accuracy $\mathbb{E}^t_n = 10^{-4}$.
  • Figure 2: Linear speed-up of DSGD, in the MSE and HP sense. Left to right and top to bottom: MSE performance and tail decay with threshold $\varepsilon = \{10^{-2},10^{-3},10^{-4}\}$. We can see that DSGD consistently achieves faster exponential tail decay for larger networks, across all values of threshold $\varepsilon$, illustrating the effect of the linear speed-up in the HP sense.

Theorems & Definitions (34)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 1
  • Lemma 4
  • Lemma 5
  • Theorem 2
  • Proposition 1: Jensen's inequality
  • Proposition 2: Cauchy-Schwartz inequality
  • Proposition 3: Young's inequality
  • ...and 24 more