Table of Contents
Fetching ...

Adaptive Federated Optimization

Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, H. Brendan McMahan

TL;DR

The paper introduces adaptive server optimization within Federated Learning (FedOpt), enabling Adagrad, Adam, and Yogi to run on the server while clients perform SGD. It provides theoretical convergence guarantees in nonconvex settings and demonstrates through a large benchmark suite that adaptive server optimizers improve convergence and ease tuning in cross-device FL. The work also presents extensive empirical comparisons against FedAvg, FedAvgM, and SCAFFOLD across diverse datasets and tasks, highlighting the practical benefits of server-side adaptivity. The authors release open-source implementations and a reproducible evaluation framework to advance federated optimization research.

Abstract

Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Adam, and Yogi, and analyze their convergence in the presence of heterogeneous data for general non-convex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can significantly improve the performance of federated learning.

Adaptive Federated Optimization

TL;DR

The paper introduces adaptive server optimization within Federated Learning (FedOpt), enabling Adagrad, Adam, and Yogi to run on the server while clients perform SGD. It provides theoretical convergence guarantees in nonconvex settings and demonstrates through a large benchmark suite that adaptive server optimizers improve convergence and ease tuning in cross-device FL. The work also presents extensive empirical comparisons against FedAvg, FedAvgM, and SCAFFOLD across diverse datasets and tasks, highlighting the practical benefits of server-side adaptivity. The authors release open-source implementations and a reproducible evaluation framework to advance federated optimization research.

Abstract

Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Adam, and Yogi, and analyze their convergence in the presence of heterogeneous data for general non-convex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can significantly improve the performance of federated learning.

Paper Structure

This paper contains 46 sections, 9 theorems, 73 equations, 14 figures, 11 tables, 8 algorithms.

Key Result

Theorem 1

Let Assumptions asp:lipschitz to asp:bounded-grad hold, and let $L, G, \sigma_l, \sigma_g$ be as defined therein. Let $\sigma^2 = \sigma_{l}^2 + 6K\sigma_{g}^2$. Consider the following conditions for $\eta_l$: Then the iterates of Algorithm alg:fall for $\textsc{FedAdagrad}\xspace$ satisfy Here, we define

Figures (14)

  • Figure 1: Validation accuracy of adaptive and non-adaptive methods, as well as SCAFFOLD, using constant learning rates $\eta$ and $\eta_l$ tuned to achieve the best training performance over the last 100 communication rounds; see \ref{['appendix:lr_grids']} for grids.
  • Figure 2: Validation accuracy (averaged over the last 100 rounds) of FedAdam, FedYogi, and FedAvgM for various client/server learning rates combination on the SO NWP task. For FedAdam and FedYogi, we set $\tau = 10^{-3}$.
  • Figure 3: Validation performance of FedAdagrad, FedAdam, and FedYogi for varying $\tau$ on various tasks. The learning rates $\eta$ and $\eta_l$ are tuned for each $\tau$ to achieve the best training performance on the last 100 communication rounds.
  • Figure 4: Validation accuracy on EMNIST CR using constant learning rates $\eta$, $\eta_l$, and $\tau$ tuned to achieve the best training performance on the last 100 communication rounds; see \ref{['appendix:hyperparameters']} for hyperparameter grids.
  • Figure 5: Validation accuracy (averaged over the last 100 rounds) of FedAdagrad, FedAdam, FedYogi, FedAvgM, and FedAvg for various client/server learning rates combination on the CIFAR-10 task. For FedAdagrad, FedAdam, and FedYogi, we set $\tau = 10^{-3}$.
  • ...and 9 more figures

Theorems & Definitions (15)

  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Corollary 2
  • Remark 1
  • proof : Proof of \ref{['thm:fadagrad_conv']}
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • ...and 5 more