Table of Contents
Fetching ...

Federated Optimization with Doubly Regularized Drift Correction

Xiaowen Jiang, Anton Rodomanov, Sebastian U. Stich

TL;DR

This work revisits DANE, an established method in distributed optimization and shows that DANE can achieve the desired communication reduction under Hessian similarity constraints, and presents an extension, DANE+, which supports arbitrary inexact local solvers and has more freedom to choose how to aggregate the local updates.

Abstract

Federated learning is a distributed optimization paradigm that allows training machine learning models across decentralized devices while keeping the data localized. The standard method, FedAvg, suffers from client drift which can hamper performance and increase communication costs over centralized methods. Previous works proposed various strategies to mitigate drift, yet none have shown uniformly improved communication-computation trade-offs over vanilla gradient descent. In this work, we revisit DANE, an established method in distributed optimization. We show that (i) DANE can achieve the desired communication reduction under Hessian similarity constraints. Furthermore, (ii) we present an extension, DANE+, which supports arbitrary inexact local solvers and has more freedom to choose how to aggregate the local updates. We propose (iii) a novel method, FedRed, which has improved local computational complexity and retains the same communication complexity compared to DANE/DANE+. This is achieved by using doubly regularized drift correction.

Federated Optimization with Doubly Regularized Drift Correction

TL;DR

This work revisits DANE, an established method in distributed optimization and shows that DANE can achieve the desired communication reduction under Hessian similarity constraints, and presents an extension, DANE+, which supports arbitrary inexact local solvers and has more freedom to choose how to aggregate the local updates.

Abstract

Federated learning is a distributed optimization paradigm that allows training machine learning models across decentralized devices while keeping the data localized. The standard method, FedAvg, suffers from client drift which can hamper performance and increase communication costs over centralized methods. Previous works proposed various strategies to mitigate drift, yet none have shown uniformly improved communication-computation trade-offs over vanilla gradient descent. In this work, we revisit DANE, an established method in distributed optimization. We show that (i) DANE can achieve the desired communication reduction under Hessian similarity constraints. Furthermore, (ii) we present an extension, DANE+, which supports arbitrary inexact local solvers and has more freedom to choose how to aggregate the local updates. We propose (iii) a novel method, FedRed, which has improved local computational complexity and retains the same communication complexity compared to DANE/DANE+. This is achieved by using doubly regularized drift correction.
Paper Structure (24 sections, 36 theorems, 201 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 24 sections, 36 theorems, 201 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Consider Algorithm Alg:FrameworkDeterministic with control variate eq:ControlVariate2. Let the global model be updated by choosing an arbitrary local model for each communication round. Let $f_i : \mathbb{R}^d \to \mathbb{R}$ be continuously differentiable for any $i \in [n]$. Assume that $\{f_i\}$ where $\Bar{\mathbf{x}}^R = \arg\min_{\mathbf{x}\in\{\mathbf{x}^r\}_{r=0}^R} \{||\nabla f(\mathbf{x

Figures (3)

  • Figure 1: Illustrating communication reduction for DANE+-GD and FedRed-GD on synthetic dataset using quadratic loss with $\frac{L}{\delta_A}\approx\frac{L}{\delta_B}\gtrsim 20$. DANE+-GD and FedRed-GD require roughly $20$ times fewer communication rounds to reach the same suboptimality as GD while the total number of local computations of FedRed-GD is at the same scale as GD. (Repeated 3 times for FedRed-GD. The solid lines and the shaded area represent the mean and the region between the minimum and the maximum values.)
  • Figure 2: Comparison of DANE+-GD and FedRed-GD against four other distributed optimizers on four LIBSVM datasets using regularized logistic loss.
  • Figure 3.1: Comparison of DANE+-SGD and FedRed-SGD against three other distributed optimizers on multi-class classification tasks with CIFAR10 and CIFAR100 datasets using ResNet18 with softmax loss. FedProx (without drift correction) exhibits faster and more stable convergence compared to the other methods.

Theorems & Definitions (65)

  • Definition 1: Bounded Hessian Dissimilarity scaffold
  • Definition 2: Averaged Hessian Dissimilarity svrp
  • Theorem 1
  • Corollary 2
  • Theorem 3
  • Theorem 4
  • Corollary 5
  • Theorem 6
  • Theorem 7
  • Theorem 8
  • ...and 55 more