Table of Contents
Fetching ...

Adaptive Federated Learning with Auto-Tuned Clients

Junhyung Lyle Kim, Mohammad Taha Toghani, César A. Uribe, Anastasios Kyrillidis

TL;DR

This paper tackles the challenge of tuning client step sizes in Federated Learning under heterogeneous data and participation. It introduces $\Delta$-SGD, a locality-adaptive SGD where each client uses a per-iteration step size that depends on local smoothness and can increase over local steps, with extensions for FL analysis. The authors provide a nonconvex convergence bound that depends on local smoothness $\tilde{L}$ and gradient noise, along with a convex Lyapunov guarantee, and demonstrate strong empirical robustness across multiple datasets, models, and heterogeneity levels, without per-task tuning. The method is complementary to server-side approaches (FedAdam, FedProx, MOON) and remains effective under various proximal or model-contrastive formulations, offering a practical solution to client-tuning in FL with wide potential impact on distributed learning systems.

Abstract

Federated learning (FL) is a distributed machine learning framework where the global model of a central server is trained via multiple collaborative steps by participating clients without sharing their data. While being a flexible framework, where the distribution of local data, participation rate, and computing power of each client can greatly vary, such flexibility gives rise to many new challenges, especially in the hyperparameter tuning on the client side. We propose $Δ$-SGD, a simple step size rule for SGD that enables each client to use its own step size by adapting to the local smoothness of the function each client is optimizing. We provide theoretical and empirical results where the benefit of the client adaptivity is shown in various FL scenarios.

Adaptive Federated Learning with Auto-Tuned Clients

TL;DR

This paper tackles the challenge of tuning client step sizes in Federated Learning under heterogeneous data and participation. It introduces -SGD, a locality-adaptive SGD where each client uses a per-iteration step size that depends on local smoothness and can increase over local steps, with extensions for FL analysis. The authors provide a nonconvex convergence bound that depends on local smoothness and gradient noise, along with a convex Lyapunov guarantee, and demonstrate strong empirical robustness across multiple datasets, models, and heterogeneity levels, without per-task tuning. The method is complementary to server-side approaches (FedAdam, FedProx, MOON) and remains effective under various proximal or model-contrastive formulations, offering a practical solution to client-tuning in FL with wide potential impact on distributed learning systems.

Abstract

Federated learning (FL) is a distributed machine learning framework where the global model of a central server is trained via multiple collaborative steps by participating clients without sharing their data. While being a flexible framework, where the distribution of local data, participation rate, and computing power of each client can greatly vary, such flexibility gives rise to many new challenges, especially in the hyperparameter tuning on the client side. We propose -SGD, a simple step size rule for SGD that enables each client to use its own step size by adapting to the local smoothness of the function each client is optimizing. We provide theoretical and empirical results where the benefit of the client adaptivity is shown in various FL scenarios.
Paper Structure (21 sections, 5 theorems, 53 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 5 theorems, 53 equations, 8 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Let Assumption assump:all hold, with $\rho=\mathcal{O}(1)$. Further, suppose that $\gamma=\mathcal{O}(\frac{1}{K\sqrt{T}})$, and $\eta_0 = \mathcal{O}(\gamma)$. Then, the following property holds for Algorithm alg:dist-adap-gd, for $T$ sufficiently large: where $\Psi_1=\max\left\{\frac{\sigma^2}{b},f(x_0)-f(x^\star)\right\}$ and $\Psi_2=\left(\frac{\sigma^2}{b}+G^2\right)$ are global constants, w

Figures (8)

  • Figure 1: Illustration of the effect of not properly tuning the client step sizes. In (A), each client optimizer uses the best step size from grid-search. Then, the same step size from (A) is intentionally used in settings (B) and (C). Only $\Delta$-SGD works well across all settings without additional tuning.
  • Figure 2: The effect of stronger heterogeneity on different client optimizers, induced by the Dirichlet concentration parameter$\alpha \in \{0.01, 0.1, 1\}$. $\Delta$-SGD remains robust performance in all cases, whereas other methods show significant performance degradation when changing the level of heterogeneity $\alpha$, or when changing the setting (model/architecture).
  • Figure 3: The effect of changing the dataset and the model architecture on different client optimizers.$\Delta$-SGD remains superior performance without additional tuning when model or dataset changes, whereas other methods often degrades in performance. (A): CIFAR-100 trained with ResNet-18 versus Resnet-50 ($\alpha=0.1$), (B): MNIST versus FMNIST trained with CNN ($\alpha=0.01$), (C): CIFAR-10 versus CIFAR-100 trained with ResNet-18 ($\alpha=0.01$).
  • Figure 4: Effect of using different $\delta$ in the second condition of the step size of $\Delta$-SGD.
  • Figure 5: Effect of the different number of local epochs.
  • ...and 3 more figures

Theorems & Definitions (10)

  • Theorem 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • proof
  • Lemma 4
  • proof
  • Theorem 5
  • proof