Table of Contents
Fetching ...

Locally Adaptive Federated Learning

Sohom Mukherjee, Nicolas Loizou, Sebastian U. Stich

TL;DR

The paper addresses federated optimization under client heterogeneity by replacing global constant stepsizes with fully locally adaptive updates based on the stochastic Polyak stepsize (SPS). It introduces FedSPS, a fully client-side adaptive algorithm, and a decreasing-stepsize variant FedDecSPS to achieve exact convergence in non-interpolating regimes. Theoretical results show sublinear and linear convergence in convex and strongly convex settings, with linear convergence under interpolation and exact convergence achievable with decreasing stepsizes in non-interpolating cases. Empirical results demonstrate that FedSPS matches or exceeds tuned FedAvg and FedAMS in both convex and non-convex tasks, while requiring less hyperparameter tuning and offering improved generalization.

Abstract

Federated learning is a paradigm of distributed machine learning in which multiple clients coordinate with a central server to learn a model, without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) ensure balance among the clients by using the same stepsize for local updates on all clients. However, this means that all clients need to respect the global geometry of the function which could yield slow convergence. In this work, we propose locally adaptive federated learning algorithms, that leverage the local geometric information for each client function. We show that such locally adaptive methods with uncoordinated stepsizes across all clients can be particularly efficient in interpolated (overparameterized) settings, and analyze their convergence in the presence of heterogeneous data for convex and strongly convex settings. We validate our theoretical claims by performing illustrative experiments for both i.i.d. non-i.i.d. cases. Our proposed algorithms match the optimization performance of tuned FedAvg in the convex setting, outperform FedAvg as well as state-of-the-art adaptive federated algorithms like FedAMS for non-convex experiments, and come with superior generalization performance.

Locally Adaptive Federated Learning

TL;DR

The paper addresses federated optimization under client heterogeneity by replacing global constant stepsizes with fully locally adaptive updates based on the stochastic Polyak stepsize (SPS). It introduces FedSPS, a fully client-side adaptive algorithm, and a decreasing-stepsize variant FedDecSPS to achieve exact convergence in non-interpolating regimes. Theoretical results show sublinear and linear convergence in convex and strongly convex settings, with linear convergence under interpolation and exact convergence achievable with decreasing stepsizes in non-interpolating cases. Empirical results demonstrate that FedSPS matches or exceeds tuned FedAvg and FedAMS in both convex and non-convex tasks, while requiring less hyperparameter tuning and offering improved generalization.

Abstract

Federated learning is a paradigm of distributed machine learning in which multiple clients coordinate with a central server to learn a model, without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) ensure balance among the clients by using the same stepsize for local updates on all clients. However, this means that all clients need to respect the global geometry of the function which could yield slow convergence. In this work, we propose locally adaptive federated learning algorithms, that leverage the local geometric information for each client function. We show that such locally adaptive methods with uncoordinated stepsizes across all clients can be particularly efficient in interpolated (overparameterized) settings, and analyze their convergence in the presence of heterogeneous data for convex and strongly convex settings. We validate our theoretical claims by performing illustrative experiments for both i.i.d. non-i.i.d. cases. Our proposed algorithms match the optimization performance of tuned FedAvg in the convex setting, outperform FedAvg as well as state-of-the-art adaptive federated algorithms like FedAMS for non-convex experiments, and come with superior generalization performance.
Paper Structure (46 sections, 17 theorems, 74 equations, 9 figures, 1 table, 3 algorithms)

This paper contains 46 sections, 17 theorems, 74 equations, 9 figures, 1 table, 3 algorithms.

Key Result

Proposition 1

Using the definitions of $\sigma_f^2$, $\zeta_{\star}^2$, and $\sigma_{\star}^2$ as defined above, we have: (a) $\zeta_{\star}^2 \leq 2 L \sigma_f^2$, and (b) $\sigma_{\star}^2 \leq 2 L \sigma_f^2$.

Figures (9)

  • Figure 1: Illustration for Example \ref{['example:local_adaptivity']}, showing local adaptivity can improve convergence. We run SGD with constant, global SPS, and locally adaptive SPS stepsizes (with $c=0.5, \gamma_b=1.0$), for functions $f_1(x) = x^2$, $f_2(x) = \frac{1}{2}x^2$, where stochastic noise was simulated by adding Gaussian noise with mean 0, and standard deviation 10 to the gradients.
  • Figure 2: Sensitivity analysis of FedSPS to hyperparameters for convex logistic regression on the MNIST dataset (i.i.d.) without client sampling. (a) Comparing the effect of varying $\gamma_b$ on FedSPS and varying $\gamma$ on FedAvg convergence---FedAvg is more sensitive to changes in $\gamma$, while FedSPS is insensitive changes in to $\gamma_b$. (b) Effect of varying $\gamma_b$ on FedSPS stepsize adaptivity---adaptivity is lost if $\gamma_b$ is chosen too small. (c) Small $c$ works well in practice ($\tau = 5$). (d) Optimal $c$ versus $\tau$, showing that there is no dependence.
  • Figure 3: Comparison for convex logistic regression. (a) MNIST dataset (i.i.d. without client sampling). (b) w8a dataset (i.i.d. with client sampling). (c) MNIST dataset (non-i.i.d. with client sampling). (d) Average stepsize across all clients for FedSPS and FedDecSPS corresponding to (c). Performance of FedSPS matches that of FedAvg with best tuned local learning rate for the i.i.d. cases, and outperforms in the non-i.i.d. case.
  • Figure 4: Non-convex MNIST experiments with client sampling. (a) Non-convex case of LeNet on MNIST dataset (i.i.d.). (b) Non-convex case of LeNet on MNIST dataset (non-i.i.d.). First column represents training loss, second column is test accuracy. Convergence of FedSPS is very close to that of FedAvg with the best possible tuned local learning rate. Moreover, FedSPS converges better than FedAMS for the non-convex MNIST case (both i.i.d. and non-i.i.d.), and also offers superior generalization performance than FedAMS. FedSPS is referred to as FedSPS-Local here in the legends, to distinguish it clearly from FedSPS-Global.
  • Figure 5: Non-convex CIFAR-10 experiments with client sampling. (a) Non-convex case of ResNet18 on CIFAR-10 dataset (i.i.d.). (b) Non-convex case of ResNet18 on CIFAR-10 dataset (non-i.i.d.). First column represents training loss, second column is test accuracy. FedSPS converges better than FedAvg and FedAMS for both i.i.d. and non-i.i.d. settings, and also offers superior generalization performance.
  • ...and 4 more figures

Theorems & Definitions (28)

  • Example 1: Local adaptivity using Polyak stepsizes can improve convergence
  • Remark 2: Alternative design choices
  • Proposition 1: Comparison of heterogeneity measures
  • Theorem 3: Convergence of FedSPS
  • Remark 4: Minimal need for hyperparamter tuning
  • Corollary 5: Linear Convergence of FedSPS under Interpolation
  • Theorem 6: Convergence of small stepsize FedSPS
  • Definition 1: Convexity
  • Definition 2: $L$-smooth
  • Lemma 7: li2019convergence_orabona, Lemma 4
  • ...and 18 more