Table of Contents
Fetching ...

SLowcal-SGD: Slow Query Points Improve Local-SGD for Stochastic Convex Optimization

Tehila Dahan, Kfir Y. Levy

TL;DR

This work tackles distributed stochastic convex optimization with heterogeneous data across machines. It introduces SLowcal-SGD, a Local-SGD–style method that uses Anytime-GD with slowly changing query points and increasing weights to reduce bias from local updates, yielding provable improvements over Minibatch-SGD and Local-SGD in the heterogeneous setting. The theoretical guarantee shows an excess loss bound that scales favorably with rounds and local steps, and experiments on MNIST with non-IID Dirichlet partitions demonstrate practical gains, especially with more workers and larger local steps. Overall, the approach advances how local updates are coordinated in heterogeneous distributed systems, potentially reducing communication overhead while maintaining convergence speed.

Abstract

We consider distributed learning scenarios where M machines interact with a parameter server along several communication rounds in order to minimize a joint objective function. Focusing on the heterogeneous case, where different machines may draw samples from different data-distributions, we design the first local update method that provably benefits over the two most prominent distributed baselines: namely Minibatch-SGD and Local-SGD. Key to our approach is a slow querying technique that we customize to the distributed setting, which in turn enables a better mitigation of the bias caused by local updates.

SLowcal-SGD: Slow Query Points Improve Local-SGD for Stochastic Convex Optimization

TL;DR

This work tackles distributed stochastic convex optimization with heterogeneous data across machines. It introduces SLowcal-SGD, a Local-SGD–style method that uses Anytime-GD with slowly changing query points and increasing weights to reduce bias from local updates, yielding provable improvements over Minibatch-SGD and Local-SGD in the heterogeneous setting. The theoretical guarantee shows an excess loss bound that scales favorably with rounds and local steps, and experiments on MNIST with non-IID Dirichlet partitions demonstrate practical gains, especially with more workers and larger local steps. Overall, the approach advances how local updates are coordinated in heterogeneous distributed systems, potentially reducing communication overhead while maintaining convergence speed.

Abstract

We consider distributed learning scenarios where M machines interact with a parameter server along several communication rounds in order to minimize a joint objective function. Focusing on the heterogeneous case, where different machines may draw samples from different data-distributions, we design the first local update method that provably benefits over the two most prominent distributed baselines: namely Minibatch-SGD and Local-SGD. Key to our approach is a slow querying technique that we customize to the distributed setting, which in turn enables a better mitigation of the bias caused by local updates.
Paper Structure (52 sections, 17 theorems, 153 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 52 sections, 17 theorems, 153 equations, 8 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

Let $f:{\mathbb R}^d\mapsto {\mathbb R}$ be a convex function with a global minimum $w^*$. Also let $\{\alpha_t\geq 0\}_t$, and $\{w_t\in{\mathbb R}^d\}_{t},\{x_t\in{\mathbb R}^d\}_{t}$ such that $\{x_t\}_{t}$ is an $\{\alpha_t\}_t$ weighted average of $\{w_t\}_{t}$. Then the following holds for any

Figures (8)

  • Figure 1: Performance vs. Local Iterations ($K$) for different numbers of workers ($M$).
  • Figure 2: Class distribution across workers for different numbers of workers (16, 32, and 64) on the MNIST dataset. The dataset was partitioned using a Dirichlet distribution, with the dirichlet-alpha parameter set to 0.1 to induce high heterogeneity. Each scatter plot illustrates class frequencies for each worker.
  • Figure 3: Test Accuracy vs. Local Iterations ($K$) for 16 workers ($\uparrow$ is better).
  • Figure 4: Test Accuracy vs. Local Iterations ($K$) for 32 workers ($\uparrow$ is better).
  • Figure 5: Test Accuracy vs. Local Iterations ($K$) for 64 workers ($\uparrow$ is better).
  • ...and 3 more figures

Theorems & Definitions (35)

  • Theorem 1: Rephrased from Theorem 1 in cutkosky2019anytime
  • Theorem 2
  • proof : Proof Sketch for Theorem \ref{['thm:Main']}
  • Lemma 1
  • proof : Proof of Lemma \ref{['lem:SVRG']}
  • proof : Proof of Theorem \ref{['theo:Anytime']}
  • proof : Proof of Thm. \ref{['thm:Main']}
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • ...and 25 more