Table of Contents
Fetching ...

On the Limits of Momentum in Decentralized and Federated Optimization

Riccardo Zaccone, Sai Praneeth Karimireddy, Carlo Masone

TL;DR

This work analyzes momentum under cyclic client participation, and theoretically proves that it remains inevitably affected by statistical heterogeneity, and proves that decreasing step-sizes do not help and any schedule decreasing faster than $\Theta\left(1/t\right)$ leads to convergence to a constant value that depends on the initialization and the heterogeneity bound.

Abstract

Recent works have explored the use of momentum in local methods to enhance distributed SGD. This is particularly appealing in Federated Learning (FL), where momentum intuitively appears as a solution to mitigate the effects of statistical heterogeneity. Despite recent progress in this direction, it is still unclear if momentum can guarantee convergence under unbounded heterogeneity in decentralized scenarios, where only some workers participate at each round. In this work we analyze momentum under cyclic client participation, and theoretically prove that it remains inevitably affected by statistical heterogeneity. Similarly to SGD, we prove that decreasing step-sizes do not help either: in fact, any schedule decreasing faster than $Θ\left(1/t\right)$ leads to convergence to a constant value that depends on the initialization and the heterogeneity bound. Numerical results corroborate the theory, and deep learning experiments confirm its relevance for realistic settings.

On the Limits of Momentum in Decentralized and Federated Optimization

TL;DR

This work analyzes momentum under cyclic client participation, and theoretically proves that it remains inevitably affected by statistical heterogeneity, and proves that decreasing step-sizes do not help and any schedule decreasing faster than leads to convergence to a constant value that depends on the initialization and the heterogeneity bound.

Abstract

Recent works have explored the use of momentum in local methods to enhance distributed SGD. This is particularly appealing in Federated Learning (FL), where momentum intuitively appears as a solution to mitigate the effects of statistical heterogeneity. Despite recent progress in this direction, it is still unclear if momentum can guarantee convergence under unbounded heterogeneity in decentralized scenarios, where only some workers participate at each round. In this work we analyze momentum under cyclic client participation, and theoretically prove that it remains inevitably affected by statistical heterogeneity. Similarly to SGD, we prove that decreasing step-sizes do not help either: in fact, any schedule decreasing faster than leads to convergence to a constant value that depends on the initialization and the heterogeneity bound. Numerical results corroborate the theory, and deep learning experiments confirm its relevance for realistic settings.

Paper Structure

This paper contains 41 sections, 15 theorems, 117 equations, 1 figure, 2 tables.

Key Result

Lemma 3.4

For any positive constants $G, \mu$, define $\mu$-strongly convex functions $f_1(\theta) := \frac{\mu}{2} \theta^2 + G\theta$ and $f_2(\theta):=\frac{\mu}{2} \theta^2 - G\theta$ satisfying assumption assum:bounded_gd and such that $f(\theta)=\frac{1}{2}\left(f_1(\theta) + f_2(\theta)\right)$. Under where, given algorithm-dependent coefficients $p_t^{(a)}, q_t^{(a)}, r_t^{(a)}$:

Figures (1)

  • Figure 1: FedAvg and FedCM under cyclic participation: under high heterogeneity and partial participation, FL-methods based on classical momentum do not offer a substantial improvement over simpler methods without momentum. Results on Cifar-10 with ResNet-20 (left) and CNN (right). The reference accuracy in centralized settings is $\approx86\%$ for CNN and $\approx 89\%$ for ResNet-20.

Theorems & Definitions (30)

  • Lemma 3.4: Behavior of FedAvgM and FedCM on two one-dimensional clients
  • proof : Proof sketch
  • Theorem 3.5
  • proof : Proof sketch
  • Theorem 3.6
  • proof : Proof sketch
  • Lemma B.1
  • proof
  • Lemma B.2
  • proof
  • ...and 20 more