Table of Contents
Fetching ...

On the Convergence of Local Descent Methods in Federated Learning

Farzin Haddadpour, Mehrdad Mahdavi

TL;DR

The paper addresses convergence of local descent methods with periodic averaging in federated learning under heterogeneous data across devices. It develops the Local Federated Descent framework and analyzes three instantiations—LFGD, LFSGD, and NFSGD—under general nonconvex objectives and the Polyak-Łojasiewicz condition, deriving sharp convergence rates that scale favorably with the number of participating devices and network structure. A key concept is gradient diversity Λ, which governs how hyperparameters should be chosen to achieve linear speedups and to control residual error, including results where no explicit variance reduction is required. The results extend to decentralized networks and provide guidance for achieving fast, communication-efficient federated optimization in realistic non-i.i.d. settings, with clear avenues for future work like adaptive synchronization and gradient-diversity reduction mechanisms.

Abstract

In federated distributed learning, the goal is to optimize a global training objective defined over distributed devices, where the data shard at each device is sampled from a possibly different distribution (a.k.a., heterogeneous or non i.i.d. data samples). In this paper, we generalize the local stochastic and full gradient descent with periodic averaging-- originally designed for homogeneous distributed optimization, to solve nonconvex optimization problems in federated learning. Although scant research is available on the effectiveness of local SGD in reducing the number of communication rounds in homogeneous setting, its convergence and communication complexity in heterogeneous setting is mostly demonstrated empirically and lacks through theoretical understating. To bridge this gap, we demonstrate that by properly analyzing the effect of unbiased gradients and sampling schema in federated setting, under mild assumptions, the implicit variance reduction feature of local distributed methods generalize to heterogeneous data shards and exhibits the best known convergence rates of homogeneous setting both in general nonconvex and under {\pl}~ condition (generalization of strong-convexity). Our theoretical results complement the recent empirical studies that demonstrate the applicability of local GD/SGD to federated learning. We also specialize the proposed local method for networked distributed optimization. To the best of our knowledge, the obtained convergence rates are the sharpest known to date on the convergence of local decant methods with periodic averaging for solving nonconvex federated optimization in both centralized and networked distributed optimization.

On the Convergence of Local Descent Methods in Federated Learning

TL;DR

The paper addresses convergence of local descent methods with periodic averaging in federated learning under heterogeneous data across devices. It develops the Local Federated Descent framework and analyzes three instantiations—LFGD, LFSGD, and NFSGD—under general nonconvex objectives and the Polyak-Łojasiewicz condition, deriving sharp convergence rates that scale favorably with the number of participating devices and network structure. A key concept is gradient diversity Λ, which governs how hyperparameters should be chosen to achieve linear speedups and to control residual error, including results where no explicit variance reduction is required. The results extend to decentralized networks and provide guidance for achieving fast, communication-efficient federated optimization in realistic non-i.i.d. settings, with clear avenues for future work like adaptive synchronization and gradient-diversity reduction mechanisms.

Abstract

In federated distributed learning, the goal is to optimize a global training objective defined over distributed devices, where the data shard at each device is sampled from a possibly different distribution (a.k.a., heterogeneous or non i.i.d. data samples). In this paper, we generalize the local stochastic and full gradient descent with periodic averaging-- originally designed for homogeneous distributed optimization, to solve nonconvex optimization problems in federated learning. Although scant research is available on the effectiveness of local SGD in reducing the number of communication rounds in homogeneous setting, its convergence and communication complexity in heterogeneous setting is mostly demonstrated empirically and lacks through theoretical understating. To bridge this gap, we demonstrate that by properly analyzing the effect of unbiased gradients and sampling schema in federated setting, under mild assumptions, the implicit variance reduction feature of local distributed methods generalize to heterogeneous data shards and exhibits the best known convergence rates of homogeneous setting both in general nonconvex and under {\pl}~ condition (generalization of strong-convexity). Our theoretical results complement the recent empirical studies that demonstrate the applicability of local GD/SGD to federated learning. We also specialize the proposed local method for networked distributed optimization. To the best of our knowledge, the obtained convergence rates are the sharpest known to date on the convergence of local decant methods with periodic averaging for solving nonconvex federated optimization in both centralized and networked distributed optimization.

Paper Structure

This paper contains 41 sections, 28 theorems, 171 equations, 1 table, 2 algorithms.

Key Result

Theorem 3.1

For LFGD$(E, K, \boldsymbol{q})$ with $E$ local updates, under Assumptions Ass:1 and Ass:3, if we choose the learning rate $\eta$ and local updates $E$ such that holds, where $\lambda$ is an upper bound over the weighted gradient diversity, i.e., $\Lambda(\boldsymbol{w}, \boldsymbol{q}) \leq \lambda$ and all local models are initialized at the same point $\bar{\boldsymbol{w}}^{(0)}$, after $T$ it

Theorems & Definitions (44)

  • Definition 1: Weighted Gradient Diversity
  • Theorem 3.1: informal
  • Remark 1
  • Remark 2
  • Remark 3
  • Theorem 3.2: Informal
  • Remark 4
  • Theorem 3.3: Informal
  • Remark 5
  • Definition 2
  • ...and 34 more