Table of Contents
Fetching ...

First Analysis of Local GD on Heterogeneous Data

Ahmed Khaled, Konstantin Mishchenko, Peter Richtárik

TL;DR

It is shown that in a low accuracy regime, the local gradient descent method has the same communication complexity as gradient descent.

Abstract

We provide the first convergence analysis of local gradient descent for minimizing the average of smooth and convex but otherwise arbitrary functions. Problems of this form and local gradient descent as a solution method are of importance in federated learning, where each function is based on private data stored by a user on a mobile device, and the data of different users can be arbitrarily heterogeneous. We show that in a low accuracy regime, the method has the same communication complexity as gradient descent.

First Analysis of Local GD on Heterogeneous Data

TL;DR

It is shown that in a low accuracy regime, the local gradient descent method has the same communication complexity as gradient descent.

Abstract

We provide the first convergence analysis of local gradient descent for minimizing the average of smooth and convex but otherwise arbitrary functions. Problems of this form and local gradient descent as a solution method are of importance in federated learning, where each function is based on private data stored by a user on a mobile device, and the data of different users can be arbitrarily heterogeneous. We show that in a low accuracy regime, the method has the same communication complexity as gradient descent.

Paper Structure

This paper contains 9 sections, 6 theorems, 44 equations, 2 figures, 1 algorithm.

Key Result

Lemma 1

Under Assumption asm:smoothness-and-convexity and for any $\gamma \geq 0$ we have where $r_{t} \overset{\text{def}}{=} \hat{x}_t - x_\ast$. In particular, if $\gamma \leq \frac{1}{4L}$, then ${\left\lVert r_{t+1}\right\rVert}^2 \leq {\left\lVert r_t\right\rVert}^2 + \tfrac{3}{2} \gamma L V_t - \gamma D_{f} (\hat{x}_t, x_\ast).$

Figures (2)

  • Figure 1: Convergence of local GD methods with different number of local steps on the 'a5a' dataset. 1 local step corresponds to fully synchronized gradient descent and it is the only method that converges precisely to the optimum. The left plot shows convergence in terms of communication rounds, showing a clear advantage of local GD when only limited accuracy is required. The mid plot, however, illustrates that wall-clock time might improve only slightly and the right plot shows what changes with different communication cost.
  • Figure 2: Same experiment as in Figure \ref{['fig:a5a_different_H']}, performed on the 'mushrooms' dataset.

Theorems & Definitions (11)

  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Corollary 1
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • proof : Proof of Lemma \ref{['lemma:optimality-gap-recursion']}
  • proof : Proof of Lemma \ref{['lemma:Vt-bound']}
  • ...and 1 more