Table of Contents
Fetching ...

AdaFed: Fair Federated Learning via Adaptive Common Descent Direction

Shayan Mohajer Hamidi, En-Hui Yang

TL;DR

The goal of AdaFed is to find an updating direction for the server along which all the clients' loss functions are decreasing; and more importantly, the loss functions for the clients with larger values decrease with a higher rate.

Abstract

Federated learning (FL) is a promising technology via which some edge devices/clients collaboratively train a machine learning model orchestrated by a server. Learning an unfair model is known as a critical problem in federated learning, where the trained model may unfairly advantage or disadvantage some of the devices. To tackle this problem, in this work, we propose AdaFed. The goal of AdaFed is to find an updating direction for the server along which (i) all the clients' loss functions are decreasing; and (ii) more importantly, the loss functions for the clients with larger values decrease with a higher rate. AdaFed adaptively tunes this common direction based on the values of local gradients and loss functions. We validate the effectiveness of AdaFed on a suite of federated datasets, and demonstrate that AdaFed outperforms state-of-the-art fair FL methods.

AdaFed: Fair Federated Learning via Adaptive Common Descent Direction

TL;DR

The goal of AdaFed is to find an updating direction for the server along which all the clients' loss functions are decreasing; and more importantly, the loss functions for the clients with larger values decrease with a higher rate.

Abstract

Federated learning (FL) is a promising technology via which some edge devices/clients collaboratively train a machine learning model orchestrated by a server. Learning an unfair model is known as a critical problem in federated learning, where the trained model may unfairly advantage or disadvantage some of the devices. To tackle this problem, in this work, we propose AdaFed. The goal of AdaFed is to find an updating direction for the server along which (i) all the clients' loss functions are decreasing; and (ii) more importantly, the loss functions for the clients with larger values decrease with a higher rate. AdaFed adaptively tunes this common direction based on the values of local gradients and loss functions. We validate the effectiveness of AdaFed on a suite of federated datasets, and demonstrate that AdaFed outperforms state-of-the-art fair FL methods.
Paper Structure (57 sections, 10 theorems, 45 equations, 7 figures, 19 tables, 1 algorithm)

This paper contains 57 sections, 10 theorems, 45 equations, 7 figures, 19 tables, 1 algorithm.

Key Result

Lemma 3.2

mukai1980algorithms Any Pareto-optimal solution is Pareto-stationary. On the other hand, if all $\{f_k(\boldsymbol{\theta})\}_{k \in [K]}$'s are convex, then any Pareto-stationary solution is weakly Pareto optimal $\boldsymbol{\theta}^*$ is called a weakly Pareto-optimal solution of eq:minpareto if

Figures (7)

  • Figure 1: (a) The Pareto front for two objective functions $f_1(\boldsymbol{\theta})$ and $f_2(\boldsymbol{\theta})$ is depicted. MGDA may converge to any points on the Pareto front. (b)-(c) Illustration of convex hull $\mathcal{G}$ and minimal-norm vector $\boldsymbol{\mathfrak{d}}(\mathcal{G})$ for two gradient vectors $\mathfrak{g}_1$ and $\mathfrak{g}_2$. In (b), $\| \mathfrak{g}_1\|_2^2 < \|\mathfrak{g}_2 \|_2^2$, where the direction of $\boldsymbol{\mathfrak{d}}(\mathcal{G})$ is more inclined toward $\mathfrak{g}_1$. In (c), $\| \mathfrak{g}_1\|_2^2 = \|\mathfrak{g}_2 \|_2^2=1$, where the direction of $\boldsymbol{\mathfrak{d}}(\mathcal{G})$ is the same as that of the bisection of $\mathfrak{g}_1$ and $\mathfrak{g}_2$.
  • Figure 2: The percentage of improved clients as a function of communication rounds for (a) CIFAR-10 setup one in \ref{['sec:CIFAR-10']}; and (b) CIFAR-100 setup one in \ref{['sec:cifar100']}.
  • Figure 3: The training loss function for two clients trained in AdaFed framework Vs. the communication rounds for (a) CIFAR-10 setup one in \ref{['sec:CIFAR-10']}; and (b) CIFAR-100 setup one in \ref{['sec:cifar100']}.
  • Figure 4: The convergence of $\| \boldsymbol{\mathfrak{d}}_t\|$ as a function of communication rounds for (a) $e=1$ and local SGD, (b) $e>1$ and local GD, and (c) $e=1$ and local GD. The dataset is CIFAR-10.
  • Figure 5: Average test accuracy across clients for different FL methods on CIFAR-10. The setup for the experiments is elaborated in \ref{['sec:CIFAR-10']}, setup 1.
  • ...and 2 more figures

Theorems & Definitions (19)

  • Definition 3.1
  • Lemma 3.2
  • Theorem 4.1
  • proof
  • Remark 4.2
  • Theorem 5.1
  • proof
  • Remark 5.2
  • Theorem 5.3: $e=1$ & local SGD
  • Theorem 5.4: $e>1$ & local GD
  • ...and 9 more