Scaffold with Stochastic Gradients: New Analysis with Linear Speed-Up
Paul Mangold, Alain Durmus, Aymeric Dieuleveut, Eric Moulines
TL;DR
This work recasts Scaffold in a Markov-chain framework, showing that the joint evolution of global parameters and client control variates forms a contractive Markov chain that converges to a unique stationary distribution in Wasserstein distance. It proves a non-asymptotic rate with linear speed-up in the number of clients, showing that the global-iterate variance scales as $O(rac{1}{N})$ up to higher-order terms while maintaining favorable convergence in stochastic settings. The analysis also reveals a persistent higher-order bias in Scaffold’s stationary regime, despite heterogeneity mitigation, highlighting limits of drift-correction and guiding design principles for improved stochastic federated methods. Collectively, the results establish Scaffold as scalable in client count under stochastic gradients and provide a rigorous foundation for analyzing covariance structure and bias in federated optimization.
Abstract
This paper proposes a novel analysis for the Scaffold algorithm, a popular method for dealing with data heterogeneity in federated learning. While its convergence in deterministic settings--where local control variates mitigate client drift--is well established, the impact of stochastic gradient updates on its performance is less understood. To address this problem, we first show that its global parameters and control variates define a Markov chain that converges to a stationary distribution in the Wasserstein distance. Leveraging this result, we prove that Scaffold achieves linear speed-up in the number of clients up to higher-order terms in the step size. Nevertheless, our analysis reveals that Scaffold retains a higher-order bias, similar to FedAvg, that does not decrease as the number of clients increases. This highlights opportunities for developing improved stochastic federated learning algorithms
