Revisiting LocalSGD and SCAFFOLD: Improved Rates and Missing Analysis
Ruichen Luo, Sebastian U Stich, Samuel Horváth, Martin Takáč
TL;DR
This work addresses distributed non-convex optimization with heterogeneous local objectives by revisiting the convergence of LocalSGD and SCAFFOLD under classic assumptions (gradient similarity, Hessian similarity, weak convexity) and a novel Lipschitz-Hessian variant. It provides new analyses showing that LocalSGD can outperform MbSGD for weakly convex functions without requiring uniform gradient similarity and can benefit from higher-order conditions, while SCAFFOLD also achieves faster convergence beyond quadratic functions under standard Hessian similarity. A key methodological contribution is a variance-trick and a noiseless-sequence construction that tightens gradient discrepancy bounds, enabling speedups under weaker assumptions. The authors also introduce a weaker assumption involving the Lipschitz continuity of a convex hull of the local functions, and validate the theory with synthetic experiments demonstrating the predicted speedups. Overall, the paper clarifies the precise conditions under which LocalSGD and SCAFFOLD outperform MbSGD in distributed non-convex settings, guiding fair comparisons and practical algorithm design.
Abstract
LocalSGD and SCAFFOLD are widely used methods in distributed stochastic optimization, with numerous applications in machine learning, large-scale data processing, and federated learning. However, rigorously establishing their theoretical advantages over simpler methods, such as minibatch SGD (MbSGD), has proven challenging, as existing analyses often rely on strong assumptions, unrealistic premises, or overly restrictive scenarios. In this work, we revisit the convergence properties of LocalSGD and SCAFFOLD under a variety of existing or weaker conditions, including gradient similarity, Hessian similarity, weak convexity, and Lipschitz continuity of the Hessian. Our analysis shows that (i) LocalSGD achieves faster convergence compared to MbSGD for weakly convex functions without requiring stronger gradient similarity assumptions; (ii) LocalSGD benefits significantly from higher-order similarity and smoothness; and (iii) SCAFFOLD demonstrates faster convergence than MbSGD for a broader class of non-quadratic functions. These theoretical insights provide a clearer understanding of the conditions under which LocalSGD and SCAFFOLD outperform MbSGD.
