Table of Contents
Fetching ...

Federated Learning Based on Dynamic Regularization

Durmus Alp Emre Acar, Yue Zhao, Ramon Matas Navarro, Matthew Mattina, Paul N. Whatmough, Venkatesh Saligrama

TL;DR

The paper tackles communication bottlenecks in federated learning with heterogeneous devices by shifting substantial computation to devices and introducing a dynamic regularizer that aligns local device minima with the global empirical loss.It introduces the FedDyn algorithm, which updates a per-round regularizer and propagates a server-side correction to ensure device optima converge to global stationary points, and provides convergence guarantees in convex, strongly convex, and nonconvex settings with rates that scale favorably with the participation ratio m/P.Empirical results across MNIST, EMNIST-L, CIFAR-10/100, and Shakespeare show substantial communication savings and robustness to partial participation, non-IID data, and large device counts compared to FedAvg, FedProx, and SCAFFOLD.The work also introduces a lite variant, FedDynOneGD, demonstrating that similar asymptotic performance can be achieved with reduced per-round computation, making the approach practical for large-scale FL deployments.

Abstract

We propose a novel federated learning method for distributively training neural network models, where the server orchestrates cooperation between a subset of randomly chosen devices in each round. We view Federated Learning problem primarily from a communication perspective and allow more device level computations to save transmission costs. We point out a fundamental dilemma, in that the minima of the local-device level empirical loss are inconsistent with those of the global empirical loss. Different from recent prior works, that either attempt inexact minimization or utilize devices for parallelizing gradient computation, we propose a dynamic regularizer for each device at each round, so that in the limit the global and device solutions are aligned. We demonstrate both through empirical results on real and synthetic data as well as analytical results that our scheme leads to efficient training, in both convex and non-convex settings, while being fully agnostic to device heterogeneity and robust to large number of devices, partial participation and unbalanced data.

Federated Learning Based on Dynamic Regularization

TL;DR

The paper tackles communication bottlenecks in federated learning with heterogeneous devices by shifting substantial computation to devices and introducing a dynamic regularizer that aligns local device minima with the global empirical loss.It introduces the FedDyn algorithm, which updates a per-round regularizer and propagates a server-side correction to ensure device optima converge to global stationary points, and provides convergence guarantees in convex, strongly convex, and nonconvex settings with rates that scale favorably with the participation ratio m/P.Empirical results across MNIST, EMNIST-L, CIFAR-10/100, and Shakespeare show substantial communication savings and robustness to partial participation, non-IID data, and large device counts compared to FedAvg, FedProx, and SCAFFOLD.The work also introduces a lite variant, FedDynOneGD, demonstrating that similar asymptotic performance can be achieved with reduced per-round computation, making the approach practical for large-scale FL deployments.

Abstract

We propose a novel federated learning method for distributively training neural network models, where the server orchestrates cooperation between a subset of randomly chosen devices in each round. We view Federated Learning problem primarily from a communication perspective and allow more device level computations to save transmission costs. We point out a fundamental dilemma, in that the minima of the local-device level empirical loss are inconsistent with those of the global empirical loss. Different from recent prior works, that either attempt inexact minimization or utilize devices for parallelizing gradient computation, we propose a dynamic regularizer for each device at each round, so that in the limit the global and device solutions are aligned. We demonstrate both through empirical results on real and synthetic data as well as analytical results that our scheme leads to efficient training, in both convex and non-convex settings, while being fully agnostic to device heterogeneity and robust to large number of devices, partial participation and unbalanced data.

Paper Structure

This paper contains 23 sections, 26 theorems, 76 equations, 13 figures, 6 tables, 2 algorithms.

Key Result

Theorem 1

Assuming a constant number of devices are selected uniformly at random in each round, $|{\cal P}_t|=P$, for a suitably chosen of $\alpha > 0$, Algorithm alg:FDL_non_convex satisfies, where ${\boldsymbol \gamma}^t{=}\frac{1}{P}\sum_{k \in {\cal P}_{t}}{\boldsymbol \theta}_k^{t},\ \ {\boldsymbol \theta}_*{=}\underset{{\boldsymbol \theta}}{\hbox{arg} \min}\ \ell({\boldsymbol \theta}),\ \ \ell_*{=}\

Figures (13)

  • Figure 1: CIFAR-10 - $\alpha$ sensitivity analysis of FedDyn.
  • Figure 2: CIFAR-10 - FedSplit and FedDyn comparison in full and $10\%$ participation settings.
  • Figure 3: MNIST - Histogram of device counts whose $40\%$ (\ref{['fig:het.0']}), $60\%$ (\ref{['fig:het.1']}), and $80\%$ (\ref{['fig:het.2']}) datapoints belong to $k$ classes.
  • Figure 4: CIFAR-10 - Convergence curves for different $100$ and $1000$ devices in the IID and Dirichlet (.3) settings with $10\%$ participation level and balanced data.
  • Figure 5: CIFAR-100 - Convergence curves for different $100$ and $500$ devices in the IID and Dirichlet (.3) settings with $10\%$ participation level and balanced data.
  • ...and 8 more figures

Theorems & Definitions (28)

  • Theorem 1
  • Definition 1
  • Theorem 2
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • ...and 18 more