Table of Contents
Fetching ...

Scaffold with Stochastic Gradients: New Analysis with Linear Speed-Up

Paul Mangold, Alain Durmus, Aymeric Dieuleveut, Eric Moulines

TL;DR

This work recasts Scaffold in a Markov-chain framework, showing that the joint evolution of global parameters and client control variates forms a contractive Markov chain that converges to a unique stationary distribution in Wasserstein distance. It proves a non-asymptotic rate with linear speed-up in the number of clients, showing that the global-iterate variance scales as $O( rac{1}{N})$ up to higher-order terms while maintaining favorable convergence in stochastic settings. The analysis also reveals a persistent higher-order bias in Scaffold’s stationary regime, despite heterogeneity mitigation, highlighting limits of drift-correction and guiding design principles for improved stochastic federated methods. Collectively, the results establish Scaffold as scalable in client count under stochastic gradients and provide a rigorous foundation for analyzing covariance structure and bias in federated optimization.

Abstract

This paper proposes a novel analysis for the Scaffold algorithm, a popular method for dealing with data heterogeneity in federated learning. While its convergence in deterministic settings--where local control variates mitigate client drift--is well established, the impact of stochastic gradient updates on its performance is less understood. To address this problem, we first show that its global parameters and control variates define a Markov chain that converges to a stationary distribution in the Wasserstein distance. Leveraging this result, we prove that Scaffold achieves linear speed-up in the number of clients up to higher-order terms in the step size. Nevertheless, our analysis reveals that Scaffold retains a higher-order bias, similar to FedAvg, that does not decrease as the number of clients increases. This highlights opportunities for developing improved stochastic federated learning algorithms

Scaffold with Stochastic Gradients: New Analysis with Linear Speed-Up

TL;DR

This work recasts Scaffold in a Markov-chain framework, showing that the joint evolution of global parameters and client control variates forms a contractive Markov chain that converges to a unique stationary distribution in Wasserstein distance. It proves a non-asymptotic rate with linear speed-up in the number of clients, showing that the global-iterate variance scales as up to higher-order terms while maintaining favorable convergence in stochastic settings. The analysis also reveals a persistent higher-order bias in Scaffold’s stationary regime, despite heterogeneity mitigation, highlighting limits of drift-correction and guiding design principles for improved stochastic federated methods. Collectively, the results establish Scaffold as scalable in client count under stochastic gradients and provide a rigorous foundation for analyzing covariance structure and bias in federated optimization.

Abstract

This paper proposes a novel analysis for the Scaffold algorithm, a popular method for dealing with data heterogeneity in federated learning. While its convergence in deterministic settings--where local control variates mitigate client drift--is well established, the impact of stochastic gradient updates on its performance is less understood. To address this problem, we first show that its global parameters and control variates define a Markov chain that converges to a stationary distribution in the Wasserstein distance. Leveraging this result, we prove that Scaffold achieves linear speed-up in the number of clients up to higher-order terms in the step size. Nevertheless, our analysis reveals that Scaffold retains a higher-order bias, similar to FedAvg, that does not decrease as the number of clients increases. This highlights opportunities for developing improved stochastic federated learning algorithms

Paper Structure

This paper contains 44 sections, 47 theorems, 261 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Lemma 4.0

Assume assum:strong-convexity and assum:smoothness. Let $Z = Z_{(1:N)}^{1:H}$ be i.i.d. random variables satisfying assum:smooth-var. Let the step size $\gamma>0$ and number of local updates $H >0$ satisfy $\gamma \leq 1/(2L)$ and $\gamma H (L + \mu) \le 1$. Then, for any $\theta, \theta' \in \mathb with $\mathrm{X} \!=\! ( \theta, \xi_{(1)}^{}, \dots, \xi_{(N)}^{} )$, and $\mathrm{X}' \!=\! ( \th

Figures (1)

  • Figure 1: Mean squared error $\mathbb{E}[ \norm{ \theta^{t} - \theta^{\star}}^2 ]$ as a function of the number of communications, with $H = 100$ and $\gamma = 0.05$, for linear regression (top row) and logistic regression (bottom row) problems. For each curve, we plot the average over $3$ runs and the standard deviation.

Theorems & Definitions (84)

  • Remark 2.1
  • Lemma 4.0
  • Theorem 4.1
  • Lemma 4.1
  • Theorem 4.2
  • Corollary 4.3
  • Lemma 4.3
  • Lemma 4.4
  • Theorem 4.5
  • Theorem 4.6
  • ...and 74 more