Federated Learning Based on Dynamic Regularization
Durmus Alp Emre Acar, Yue Zhao, Ramon Matas Navarro, Matthew Mattina, Paul N. Whatmough, Venkatesh Saligrama
TL;DR
The paper tackles communication bottlenecks in federated learning with heterogeneous devices by shifting substantial computation to devices and introducing a dynamic regularizer that aligns local device minima with the global empirical loss.It introduces the FedDyn algorithm, which updates a per-round regularizer and propagates a server-side correction to ensure device optima converge to global stationary points, and provides convergence guarantees in convex, strongly convex, and nonconvex settings with rates that scale favorably with the participation ratio m/P.Empirical results across MNIST, EMNIST-L, CIFAR-10/100, and Shakespeare show substantial communication savings and robustness to partial participation, non-IID data, and large device counts compared to FedAvg, FedProx, and SCAFFOLD.The work also introduces a lite variant, FedDynOneGD, demonstrating that similar asymptotic performance can be achieved with reduced per-round computation, making the approach practical for large-scale FL deployments.
Abstract
We propose a novel federated learning method for distributively training neural network models, where the server orchestrates cooperation between a subset of randomly chosen devices in each round. We view Federated Learning problem primarily from a communication perspective and allow more device level computations to save transmission costs. We point out a fundamental dilemma, in that the minima of the local-device level empirical loss are inconsistent with those of the global empirical loss. Different from recent prior works, that either attempt inexact minimization or utilize devices for parallelizing gradient computation, we propose a dynamic regularizer for each device at each round, so that in the limit the global and device solutions are aligned. We demonstrate both through empirical results on real and synthetic data as well as analytical results that our scheme leads to efficient training, in both convex and non-convex settings, while being fully agnostic to device heterogeneity and robust to large number of devices, partial participation and unbalanced data.
