Locally Adaptive Federated Learning

Sohom Mukherjee; Nicolas Loizou; Sebastian U. Stich

Locally Adaptive Federated Learning

Sohom Mukherjee, Nicolas Loizou, Sebastian U. Stich

TL;DR

The paper addresses federated optimization under client heterogeneity by replacing global constant stepsizes with fully locally adaptive updates based on the stochastic Polyak stepsize (SPS). It introduces FedSPS, a fully client-side adaptive algorithm, and a decreasing-stepsize variant FedDecSPS to achieve exact convergence in non-interpolating regimes. Theoretical results show sublinear and linear convergence in convex and strongly convex settings, with linear convergence under interpolation and exact convergence achievable with decreasing stepsizes in non-interpolating cases. Empirical results demonstrate that FedSPS matches or exceeds tuned FedAvg and FedAMS in both convex and non-convex tasks, while requiring less hyperparameter tuning and offering improved generalization.

Abstract

Federated learning is a paradigm of distributed machine learning in which multiple clients coordinate with a central server to learn a model, without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) ensure balance among the clients by using the same stepsize for local updates on all clients. However, this means that all clients need to respect the global geometry of the function which could yield slow convergence. In this work, we propose locally adaptive federated learning algorithms, that leverage the local geometric information for each client function. We show that such locally adaptive methods with uncoordinated stepsizes across all clients can be particularly efficient in interpolated (overparameterized) settings, and analyze their convergence in the presence of heterogeneous data for convex and strongly convex settings. We validate our theoretical claims by performing illustrative experiments for both i.i.d. non-i.i.d. cases. Our proposed algorithms match the optimization performance of tuned FedAvg in the convex setting, outperform FedAvg as well as state-of-the-art adaptive federated algorithms like FedAMS for non-convex experiments, and come with superior generalization performance.

Locally Adaptive Federated Learning

TL;DR

Abstract

Paper Structure (46 sections, 17 theorems, 74 equations, 9 figures, 1 table, 3 algorithms)

This paper contains 46 sections, 17 theorems, 74 equations, 9 figures, 1 table, 3 algorithms.

Introduction
Additional related Work
Problem setup
Locally Adaptive Federated Optimization
Background and Motivation
Proposed Method
Convergence analysis of FedSPS
Assumptions on the objective function and noise
Convergence of fully locally adaptive FedSPS
Decreasing FedSPS for exact convergence
Experiments
Conclusion
Technical preliminaries
General definitions
General inequalities
...and 31 more sections

Key Result

Proposition 1

Using the definitions of $\sigma_f^2$, $\zeta_{\star}^2$, and $\sigma_{\star}^2$ as defined above, we have: (a) $\zeta_{\star}^2 \leq 2 L \sigma_f^2$, and (b) $\sigma_{\star}^2 \leq 2 L \sigma_f^2$.

Figures (9)

Figure 1: Illustration for Example \ref{['example:local_adaptivity']}, showing local adaptivity can improve convergence. We run SGD with constant, global SPS, and locally adaptive SPS stepsizes (with $c=0.5, \gamma_b=1.0$), for functions $f_1(x) = x^2$, $f_2(x) = \frac{1}{2}x^2$, where stochastic noise was simulated by adding Gaussian noise with mean 0, and standard deviation 10 to the gradients.
Figure 2: Sensitivity analysis of FedSPS to hyperparameters for convex logistic regression on the MNIST dataset (i.i.d.) without client sampling. (a) Comparing the effect of varying $\gamma_b$ on FedSPS and varying $\gamma$ on FedAvg convergence---FedAvg is more sensitive to changes in $\gamma$, while FedSPS is insensitive changes in to $\gamma_b$. (b) Effect of varying $\gamma_b$ on FedSPS stepsize adaptivity---adaptivity is lost if $\gamma_b$ is chosen too small. (c) Small $c$ works well in practice ($\tau = 5$). (d) Optimal $c$ versus $\tau$, showing that there is no dependence.
Figure 3: Comparison for convex logistic regression. (a) MNIST dataset (i.i.d. without client sampling). (b) w8a dataset (i.i.d. with client sampling). (c) MNIST dataset (non-i.i.d. with client sampling). (d) Average stepsize across all clients for FedSPS and FedDecSPS corresponding to (c). Performance of FedSPS matches that of FedAvg with best tuned local learning rate for the i.i.d. cases, and outperforms in the non-i.i.d. case.
Figure 4: Non-convex MNIST experiments with client sampling. (a) Non-convex case of LeNet on MNIST dataset (i.i.d.). (b) Non-convex case of LeNet on MNIST dataset (non-i.i.d.). First column represents training loss, second column is test accuracy. Convergence of FedSPS is very close to that of FedAvg with the best possible tuned local learning rate. Moreover, FedSPS converges better than FedAMS for the non-convex MNIST case (both i.i.d. and non-i.i.d.), and also offers superior generalization performance than FedAMS. FedSPS is referred to as FedSPS-Local here in the legends, to distinguish it clearly from FedSPS-Global.
Figure 5: Non-convex CIFAR-10 experiments with client sampling. (a) Non-convex case of ResNet18 on CIFAR-10 dataset (i.i.d.). (b) Non-convex case of ResNet18 on CIFAR-10 dataset (non-i.i.d.). First column represents training loss, second column is test accuracy. FedSPS converges better than FedAvg and FedAMS for both i.i.d. and non-i.i.d. settings, and also offers superior generalization performance.
...and 4 more figures

Theorems & Definitions (28)

Example 1: Local adaptivity using Polyak stepsizes can improve convergence
Remark 2: Alternative design choices
Proposition 1: Comparison of heterogeneity measures
Theorem 3: Convergence of FedSPS
Remark 4: Minimal need for hyperparamter tuning
Corollary 5: Linear Convergence of FedSPS under Interpolation
Theorem 6: Convergence of small stepsize FedSPS
Definition 1: Convexity
Definition 2: $L$-smooth
Lemma 7: li2019convergence_orabona, Lemma 4
...and 18 more

Locally Adaptive Federated Learning

TL;DR

Abstract

Locally Adaptive Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (28)