Table of Contents
Fetching ...

Achieving Linear Speedup with ProxSkip in Distributed Stochastic Optimization

Luyao Guo, Sulaiman A. Alghunaim, Kun Yuan, Laurent Condat, Jinde Cao

TL;DR

This work provides a unified non-asymptotic analysis of ProxSkip for distributed stochastic optimization across non-convex, convex, and strongly convex settings. It demonstrates that ProxSkip achieves linear speedup with respect to the number of nodes n and, in the strongly convex case, can do so with network-independent stepsizes. The results reveal how gradient noise, local updates, network connectivity, and data heterogeneity influence convergence, and show that increasing local updates reduces communication complexity without sacrificing accuracy. Comprehensive experiments on synthetic data and the ijcnn1 dataset corroborate the theoretical findings, highlighting ProxSkip’s robustness to heterogeneity and its competitive performance against other local-update methods.

Abstract

The ProxSkip algorithm for distributed optimization is gaining increasing attention due to its effectiveness in reducing communication. However, existing analyses of ProxSkip are limited to the strongly convex setting and fail to achieve linear speedup with respect to the number of nodes. Key questions regarding its behavior in the non-convex setting and the achievability of linear speedup remain open. In this paper, we revisit ProxSkip and address both questions. We provide a comprehensive analysis for stochastic non-convex, convex, and strongly convex problems, revealing the effects of gradient noise, local updates, network connectivity, and data heterogeneity on its convergence. We prove that ProxSkip achieves linear speedup across all three settings, and can further achieve linear speedup with network-independent stepsizes in the strongly convex setting. Moreover, we show that properly increasing local updates effectively reduces communication complexity.

Achieving Linear Speedup with ProxSkip in Distributed Stochastic Optimization

TL;DR

This work provides a unified non-asymptotic analysis of ProxSkip for distributed stochastic optimization across non-convex, convex, and strongly convex settings. It demonstrates that ProxSkip achieves linear speedup with respect to the number of nodes n and, in the strongly convex case, can do so with network-independent stepsizes. The results reveal how gradient noise, local updates, network connectivity, and data heterogeneity influence convergence, and show that increasing local updates reduces communication complexity without sacrificing accuracy. Comprehensive experiments on synthetic data and the ijcnn1 dataset corroborate the theoretical findings, highlighting ProxSkip’s robustness to heterogeneity and its competitive performance against other local-update methods.

Abstract

The ProxSkip algorithm for distributed optimization is gaining increasing attention due to its effectiveness in reducing communication. However, existing analyses of ProxSkip are limited to the strongly convex setting and fail to achieve linear speedup with respect to the number of nodes. Key questions regarding its behavior in the non-convex setting and the achievability of linear speedup remain open. In this paper, we revisit ProxSkip and address both questions. We provide a comprehensive analysis for stochastic non-convex, convex, and strongly convex problems, revealing the effects of gradient noise, local updates, network connectivity, and data heterogeneity on its convergence. We prove that ProxSkip achieves linear speedup across all three settings, and can further achieve linear speedup with network-independent stepsizes in the strongly convex setting. Moreover, we show that properly increasing local updates effectively reduces communication complexity.
Paper Structure (33 sections, 17 theorems, 241 equations, 6 figures, 2 tables)

This paper contains 33 sections, 17 theorems, 241 equations, 6 figures, 2 tables.

Key Result

Lemma 1

Suppose that Assumptions MixingMatrix, ASS1, and StochasticGradient1 hold, and $f_i$ is $\mu$-strongly convex for some $0<\mu\leq L$. If $0<\alpha\leq1/L$, $\beta=p$, and $\chi\geq1$, it holds that where $a_0$ is a constant that depends on the initialization and $\zeta=\max\left\{1-\alpha\mu,1-\frac{(1-\lambda_2)p^2}{2\chi} \right\}<1$.

Figures (6)

  • Figure 1: Learning synthetic convex function over $10$ nodes with noise $\sigma^2=1$ (Local-DSGD Wang2021Stich2020, $K$-GT Liu2023, and LED Alghunaim2023). All uses the same stepsize and are averaged by ten repetitions. The probability of communication for ProxSkip is $p$, and the number of local updates of local-DSGD, $K$-GT, and LED are $1/p$.
  • Figure 2: Experimental results for ProxSkip to logistic regression problem with a strongly convex regularizer $r({\bf{x}})=\frac{1}{2}\|{\bf{x}}\|^2$ over ijcnn1 dataset.
  • Figure 3: Experimental results of logistic regression problem on the ijcnn1 dataset with regularizer $r({\bf{x}})=\frac{L}{200}\|{\bf{x}}\|^2$, where $1-\lambda_2\approx0.25$. We set $p=1/\sqrt{(1-\lambda_2)\kappa}\approx0.2$ (the theory predicted optimal choice).
  • Figure 4: Linear speedup of ProxSkip in non-convex settings over ijcnn1 dataset.
  • Figure 5: Experimental comparison with the same stepsize in non-convex settings over ijcnn1 dataset.
  • ...and 1 more figures

Theorems & Definitions (39)

  • Lemma 1
  • Theorem 1
  • Corollary 1
  • Corollary 2
  • Theorem 2
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Theorem 3
  • ...and 29 more