Table of Contents
Fetching ...

CAFe: Cost and Age aware Federated Learning

Sahan Liyanaarachchi, Kanchana Thilakarathna, Sennur Ulukus

TL;DR

The paper tackles resource wastage and communication costs in federated learning with heterogeneous client resources by introducing the Age of Clients (AoC) as a convergence-relevant metric. It analyzes a minimal-learner MCU scheme with a reporting deadline, derives closed-form expressions for resource wastage, communication cost, and AoC, and proves that setting $M=1$ often optimizes these metrics while also linking AoC to convergence bounds. To address heterogeneity, two schemes—Age Weighted Update (AWU) and Aggregated Gradient Update (AGU)—are proposed, and their effectiveness is demonstrated via MNIST experiments under IID and non-IID data distributions. The work provides a principled framework for selecting $M$ and $T$ to balance efficiency and convergence, and offers practical enhancements for robustness against biased or adversarial clients in heterogeneous FL environments.

Abstract

In many federated learning (FL) models, a common strategy employed to ensure the progress in the training process, is to wait for at least $M$ clients out of the total $N$ clients to send back their local gradients based on a reporting deadline $T$, once the parameter server (PS) has broadcasted the global model. If enough clients do not report back within the deadline, the particular round is considered to be a failed round and the training round is restarted from scratch. If enough clients have responded back, the round is deemed successful and the local gradients of all the clients that responded back are used to update the global model. In either case, the clients that failed to report back an update within the deadline would have wasted their computational resources. Having a tighter deadline (small $T$) and waiting for a larger number of participating clients (large $M$) leads to a large number of failed rounds and therefore greater communication cost and computation resource wastage. However, having a larger $T$ leads to longer round durations whereas smaller $M$ may lead to noisy gradients. Therefore, there is a need to optimize the parameters $M$ and $T$ such that communication cost and the resource wastage is minimized while having an acceptable convergence rate. In this regard, we show that the average age of a client at the PS appears explicitly in the theoretical convergence bound, and therefore, can be used as a metric to quantify the convergence of the global model. We provide an analytical scheme to select the parameters $M$ and $T$ in this setting.

CAFe: Cost and Age aware Federated Learning

TL;DR

The paper tackles resource wastage and communication costs in federated learning with heterogeneous client resources by introducing the Age of Clients (AoC) as a convergence-relevant metric. It analyzes a minimal-learner MCU scheme with a reporting deadline, derives closed-form expressions for resource wastage, communication cost, and AoC, and proves that setting often optimizes these metrics while also linking AoC to convergence bounds. To address heterogeneity, two schemes—Age Weighted Update (AWU) and Aggregated Gradient Update (AGU)—are proposed, and their effectiveness is demonstrated via MNIST experiments under IID and non-IID data distributions. The work provides a principled framework for selecting and to balance efficiency and convergence, and offers practical enhancements for robustness against biased or adversarial clients in heterogeneous FL environments.

Abstract

In many federated learning (FL) models, a common strategy employed to ensure the progress in the training process, is to wait for at least clients out of the total clients to send back their local gradients based on a reporting deadline , once the parameter server (PS) has broadcasted the global model. If enough clients do not report back within the deadline, the particular round is considered to be a failed round and the training round is restarted from scratch. If enough clients have responded back, the round is deemed successful and the local gradients of all the clients that responded back are used to update the global model. In either case, the clients that failed to report back an update within the deadline would have wasted their computational resources. Having a tighter deadline (small ) and waiting for a larger number of participating clients (large ) leads to a large number of failed rounds and therefore greater communication cost and computation resource wastage. However, having a larger leads to longer round durations whereas smaller may lead to noisy gradients. Therefore, there is a need to optimize the parameters and such that communication cost and the resource wastage is minimized while having an acceptable convergence rate. In this regard, we show that the average age of a client at the PS appears explicitly in the theoretical convergence bound, and therefore, can be used as a metric to quantify the convergence of the global model. We provide an analytical scheme to select the parameters and in this setting.
Paper Structure (18 sections, 6 theorems, 32 equations, 8 figures, 2 tables)

This paper contains 18 sections, 6 theorems, 32 equations, 8 figures, 2 tables.

Key Result

Theorem 1

Under the given federated learning (FL) model, the average resource wastage is given by, where $p_n = {N \choose n}p^n(1-p)^{N-n}$ and $q=\sum_{n=0}^{M-1}p_n$.

Figures (8)

  • Figure 1: Federated learning (FL) model.
  • Figure 2: The evolution of the instantaneous age of the $k$th client with time.
  • Figure 3: $J(x)$ for $N=50$, $\lambda =1$, $\alpha_w = 20$ and $\alpha_b=100$.
  • Figure 4: Variation of accuracy with $M$ for non i.i.d. MNIST dataset after 1000 training rounds.
  • Figure 5: Variation of the accuracy with the number of training rounds for different values of $M$ for i.i.d. MNIST dataset and $T=0.5$.
  • ...and 3 more figures

Theorems & Definitions (11)

  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Corollary 2
  • Remark 1
  • Remark 2
  • Definition 1
  • Remark 3
  • Theorem 3
  • Corollary 3
  • ...and 1 more