Table of Contents
Fetching ...

SCAFFOLD: Stochastic Controlled Averaging for Federated Learning

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

TL;DR

Federated learning with FedAvg struggles on heterogeneous data due to client-drift, slowing convergence and increasing communication. The paper introduces SCAFFOLD, which uses a server control variate $\bm{c}$ and per-client variates $\bm{c}_i$ to correct local update drift, achieving convergence rates on par with or faster than SGD and demonstrating robustness to client sampling. In quadratics, the method can exploit Hessian similarity $\delta$ to further reduce rounds, with an optimal local-step count near $K \approx \beta/\delta$; substantial theoretical guarantees cover strongly convex, convex, and non-convex cases. Empirical results on simulated data and EMNIST show SCAFFOLD consistently outperforms SGD, FedAvg, and FedProx, especially as data heterogeneity grows or similarity increases, highlighting the practical impact for communication-efficient federated learning.

Abstract

Federated Averaging (FedAvg) has emerged as the algorithm of choice for federated learning due to its simplicity and low communication cost. However, in spite of recent research efforts, its performance is not fully understood. We obtain tight convergence rates for FedAvg and prove that it suffers from `client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow convergence. As a solution, we propose a new algorithm (SCAFFOLD) which uses control variates (variance reduction) to correct for the `client-drift' in its local updates. We prove that SCAFFOLD requires significantly fewer communication rounds and is not affected by data heterogeneity or client sampling. Further, we show that (for quadratics) SCAFFOLD can take advantage of similarity in the client's data yielding even faster convergence. The latter is the first result to quantify the usefulness of local-steps in distributed optimization.

SCAFFOLD: Stochastic Controlled Averaging for Federated Learning

TL;DR

Federated learning with FedAvg struggles on heterogeneous data due to client-drift, slowing convergence and increasing communication. The paper introduces SCAFFOLD, which uses a server control variate and per-client variates to correct local update drift, achieving convergence rates on par with or faster than SGD and demonstrating robustness to client sampling. In quadratics, the method can exploit Hessian similarity to further reduce rounds, with an optimal local-step count near ; substantial theoretical guarantees cover strongly convex, convex, and non-convex cases. Empirical results on simulated data and EMNIST show SCAFFOLD consistently outperforms SGD, FedAvg, and FedProx, especially as data heterogeneity grows or similarity increases, highlighting the practical impact for communication-efficient federated learning.

Abstract

Federated Averaging (FedAvg) has emerged as the algorithm of choice for federated learning due to its simplicity and low communication cost. However, in spite of recent research efforts, its performance is not fully understood. We obtain tight convergence rates for FedAvg and prove that it suffers from `client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow convergence. As a solution, we propose a new algorithm (SCAFFOLD) which uses control variates (variance reduction) to correct for the `client-drift' in its local updates. We prove that SCAFFOLD requires significantly fewer communication rounds and is not affected by data heterogeneity or client sampling. Further, we show that (for quadratics) SCAFFOLD can take advantage of similarity in the client's data yielding even faster convergence. The latter is the first result to quantify the usefulness of local-steps in distributed optimization.

Paper Structure

This paper contains 58 sections, 31 theorems, 164 equations, 3 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

For $\beta$-smooth functions $\{f_i\}$ which satisfy asm:heterogeneity, the output of FedAvg has expected error smaller than $\epsilon$ for some values of $\eta_l, \eta_g, R$ satisfying where $D := \norm{\bm{x}^0 - \bm{x}^\star}^2$ and $F := f(\bm{x}^0) - f^\star$.

Figures (3)

  • Figure 1: Client-drift in FedAvg is illustrated for 2 clients with 3 local steps ($N=2$, $K=3$). The local updates $\bm{y}_i$ (in blue) move towards the individual client optima $\bm{x}_i^\star$ (orange square). The server updates (in red) move towards $\frac{1}{N}\sum_{i}\bm{x}_i^\star$ instead of to the true optimum $\bm{x}^\star$ (black square).
  • Figure 2: Update steps of SCAFFOLD on a single client. The local gradient (dashed black) points to $\bm{x}_1^\star$ (orange square), but the correction term $(\bm{c} - \bm{c}_i)$ (in red) ensures the update moves towards the true optimum $\bm{x}^\star$ (black square).
  • Figure 3: SGD (dashed black), FedAvg (above), and SCAFFOLD (below) on simulated data. FedAvg gets worse as local steps increases with $K=10$ (red) worse than $K=2$ (orange). It also gets slower as gradient-dissimilarity ($G$) increases (to the right). SCAFFOLD significantly improves with more local steps, with $K=10$ (blue) faster than $K=2$ (light blue) and SGD. Its performance is identical as we vary heterogeneity ($G$).

Theorems & Definitions (57)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Lemma 1: linear convergence rate
  • proof
  • Lemma 2: sub-linear convergence rate
  • proof
  • Lemma 3: relaxed triangle inequality
  • proof
  • ...and 47 more