Table of Contents
Fetching ...

A Bias-Correction Decentralized Stochastic Gradient Algorithm with Momentum Acceleration

Yuchen Hu, Xi Chen, Weidong Liu, Xiaojun Mao

TL;DR

The paper tackles decentralized stochastic optimization over sparse networks with data heterogeneity by introducing Exact-Diffusion with Momentum (EDM), a momentum-augmented bias-correction algorithm. EDM extends the ED/D^2 framework by incorporating a momentum term to accelerate convergence while effectively neutralizing heterogeneity-induced bias, with the key update X^{(t+2)} = W(2X^{(t+1)} - X^{(t)} - alpha\mathbf{M}^{(t+1)} + \u007Falpha\mathbf{M}^{(t)}). The authors prove that, under non-convex objectives, EDM converges sub-linearly to a neighborhood whose radius is independent of data heterogeneity, and under the Polyak-Lojasiewicz condition, EDM achieves linear convergence to a neighborhood, with a convergence bound that is tighter than prior momentum-based bias-correction methods. The analysis introduces auxiliary variables, variance decomposition, and consensus-transform techniques to tightly bound consensus errors and gradient progress, and demonstrates that EDM can eliminate heterogeneity effects at a faster rate than DSGT-based momentum methods. Overall, EDM provides a practically robust method for accelerating decentralized training in sparse networks with heterogeneous data, while offering a rigorous framework that could extend to other momentum-based bias-correction algorithms.

Abstract

Distributed stochastic optimization algorithms can simultaneously process large-scale datasets, significantly accelerating model training. However, their effectiveness is often hindered by the sparsity of distributed networks and data heterogeneity. In this paper, we propose a momentum-accelerated distributed stochastic gradient algorithm, termed Exact-Diffusion with Momentum (EDM), which mitigates the bias from data heterogeneity and incorporates momentum techniques commonly used in deep learning to enhance convergence rate. Our theoretical analysis demonstrates that the EDM algorithm converges sub-linearly to the neighborhood of the optimal solution, the radius of which is irrespective of data heterogeneity, when applied to non-convex objective functions; under the Polyak-Lojasiewicz condition, which is a weaker assumption than strong convexity, it converges linearly to the target region. Our analysis techniques employed to handle momentum in complex distributed parameter update structures yield a sufficiently tight convergence upper bound, offering a new perspective for the theoretical analysis of other momentum-based distributed algorithms.

A Bias-Correction Decentralized Stochastic Gradient Algorithm with Momentum Acceleration

TL;DR

The paper tackles decentralized stochastic optimization over sparse networks with data heterogeneity by introducing Exact-Diffusion with Momentum (EDM), a momentum-augmented bias-correction algorithm. EDM extends the ED/D^2 framework by incorporating a momentum term to accelerate convergence while effectively neutralizing heterogeneity-induced bias, with the key update X^{(t+2)} = W(2X^{(t+1)} - X^{(t)} - alpha\mathbf{M}^{(t+1)} + \u007Falpha\mathbf{M}^{(t)}). The authors prove that, under non-convex objectives, EDM converges sub-linearly to a neighborhood whose radius is independent of data heterogeneity, and under the Polyak-Lojasiewicz condition, EDM achieves linear convergence to a neighborhood, with a convergence bound that is tighter than prior momentum-based bias-correction methods. The analysis introduces auxiliary variables, variance decomposition, and consensus-transform techniques to tightly bound consensus errors and gradient progress, and demonstrates that EDM can eliminate heterogeneity effects at a faster rate than DSGT-based momentum methods. Overall, EDM provides a practically robust method for accelerating decentralized training in sparse networks with heterogeneous data, while offering a rigorous framework that could extend to other momentum-based bias-correction algorithms.

Abstract

Distributed stochastic optimization algorithms can simultaneously process large-scale datasets, significantly accelerating model training. However, their effectiveness is often hindered by the sparsity of distributed networks and data heterogeneity. In this paper, we propose a momentum-accelerated distributed stochastic gradient algorithm, termed Exact-Diffusion with Momentum (EDM), which mitigates the bias from data heterogeneity and incorporates momentum techniques commonly used in deep learning to enhance convergence rate. Our theoretical analysis demonstrates that the EDM algorithm converges sub-linearly to the neighborhood of the optimal solution, the radius of which is irrespective of data heterogeneity, when applied to non-convex objective functions; under the Polyak-Lojasiewicz condition, which is a weaker assumption than strong convexity, it converges linearly to the target region. Our analysis techniques employed to handle momentum in complex distributed parameter update structures yield a sufficiently tight convergence upper bound, offering a new perspective for the theoretical analysis of other momentum-based distributed algorithms.

Paper Structure

This paper contains 22 sections, 8 theorems, 77 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

Suppose that Assumptions assump of W-assump of variance hold. We have the following inequality holds for $t\geq 0$,

Figures (4)

  • Figure 1: Quadratic loss function. We choose $\sigma^2 = 0.05$, $\alpha = 0.05$, $\lambda = 0.99$ and use different $\zeta^2$ to control the heterogeneity of $f_i$.
  • Figure 2: Logistic regression with $l_2$-regularization, random noise version. We choose $\sigma_s^2 = 0.01$, $\alpha = 0.5$, $\lambda = 0.99$, and use $\sigma_h^2$ to control the variance of $\mathbf{x}_i^\star$, which reflects the heterogeneity.
  • Figure 3: Classification of CIFAR10 by VGG11. Learning rate $\alpha = 0.1$, heterogeneity parameter $\phi = 1.0$ (heterogeneous).
  • Figure 4: Classification of CIFAR10 by VGG11. Learning rate $\alpha = 0.1$, heterogeneity parameter $\phi = 0.1$ (very heterogeneous).

Theorems & Definitions (15)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Remark 5
  • Theorem 5
  • ...and 5 more