Table of Contents
Fetching ...

Mitigating System Bias in Resource Constrained Asynchronous Federated Learning Systems

Jikun Gao, Ioannis Mavromatis, Peizheng Li, Pietro Carnelli, Aftab Khan

TL;DR

The paper tackles system bias in resource-constrained asynchronous federated learning caused by device heterogeneity and non-IID data. It introduces a buffer-based dynamic aggregation that assigns a scaling factor to incoming updates using a staleness factor and the upload frequency, with the key formula $\beta_n = \frac{|D_n| \\cdot e^{s_n \\cdot \\frac{1}{f_n}}}{\sum_{b=1}^{B} |D_b| \\cdot e^{s_b \\cdot \\frac{1}{f_b}}}$ and $s = (t - \tau + 1)^{-{\\alpha}}$. The global model is updated after aggregating the scaled updates, and the latest global model is immediately provided to clients after each upload to minimize idling. Experiments on Fashion-MNIST with 10 simulated clients demonstrate substantial gains over PAPAYA and FedAsync, underscoring improved robustness and scalability for real-world AFL deployments.

Abstract

Federated learning (FL) systems face performance challenges in dealing with heterogeneous devices and non-identically distributed data across clients. We propose a dynamic global model aggregation method within Asynchronous Federated Learning (AFL) deployments to address these issues. Our aggregation method scores and adjusts the weighting of client model updates based on their upload frequency to accommodate differences in device capabilities. Additionally, we also immediately provide an updated global model to clients after they upload their local models to reduce idle time and improve training efficiency. We evaluate our approach within an AFL deployment consisting of 10 simulated clients with heterogeneous compute constraints and non-IID data. The simulation results, using the FashionMNIST dataset, demonstrate over 10% and 19% improvement in global model accuracy compared to state-of-the-art methods PAPAYA and FedAsync, respectively. Our dynamic aggregation method allows reliable global model training despite limiting client resources and statistical data heterogeneity. This improves robustness and scalability for real-world FL deployments.

Mitigating System Bias in Resource Constrained Asynchronous Federated Learning Systems

TL;DR

The paper tackles system bias in resource-constrained asynchronous federated learning caused by device heterogeneity and non-IID data. It introduces a buffer-based dynamic aggregation that assigns a scaling factor to incoming updates using a staleness factor and the upload frequency, with the key formula and . The global model is updated after aggregating the scaled updates, and the latest global model is immediately provided to clients after each upload to minimize idling. Experiments on Fashion-MNIST with 10 simulated clients demonstrate substantial gains over PAPAYA and FedAsync, underscoring improved robustness and scalability for real-world AFL deployments.

Abstract

Federated learning (FL) systems face performance challenges in dealing with heterogeneous devices and non-identically distributed data across clients. We propose a dynamic global model aggregation method within Asynchronous Federated Learning (AFL) deployments to address these issues. Our aggregation method scores and adjusts the weighting of client model updates based on their upload frequency to accommodate differences in device capabilities. Additionally, we also immediately provide an updated global model to clients after they upload their local models to reduce idle time and improve training efficiency. We evaluate our approach within an AFL deployment consisting of 10 simulated clients with heterogeneous compute constraints and non-IID data. The simulation results, using the FashionMNIST dataset, demonstrate over 10% and 19% improvement in global model accuracy compared to state-of-the-art methods PAPAYA and FedAsync, respectively. Our dynamic aggregation method allows reliable global model training despite limiting client resources and statistical data heterogeneity. This improves robustness and scalability for real-world FL deployments.
Paper Structure (6 sections, 1 equation, 5 figures, 3 tables, 1 algorithm)

This paper contains 6 sections, 1 equation, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: A typical AFL system overview. Starting from the left, an initialised model is sent to all clients. Clients commence training (green), and once completed, they share their model with the parameter server. Once a new client model is uploaded, it is immediately aggregated into the global model (yellow diamonds) and sent back to the client.
  • Figure 2: Our asynchronous FL system overview diagram. Clients train using local datasets before sharing model parameters with the Parameter Server. In turn, client models are scored/weighted according to the frequency of updates before being aggregated into a new global model for sharing with client devices.
  • Figure 3: Diagram of our proposed method with an aggregation goal of 3 client models. Clients immediately receive the latest aggregation/global model once they upload their local model to the PS. However, we store and score the incoming client models in the PS buffer layer to reduce biases when aggregated. If the new global model is sufficiently different (once the buffer is full) it is then re-shared with all the clients.
  • Figure 4: Results plotted for IID versus various Non-IID FL client training dataset distributions settings with their computing resources corresponding to the following units/fractions $[100, 95, 90, 85, 80, 25, 20, 15, 10, 5]$ where the larger values imply better computing resources. Note, that the hyperparameter settings used are shown in Table \ref{['tab:hyperparameters']}. Exp 1.1: IID client training data, Exp 1.2: Resource constrained devices with 3 training classes, Exp 1.3 Resource constrained devices with all training classes, and Exp 1.4: Client training data with Dirichlet distribution.
  • Figure 5: Comparison between our proposed method, Meta's PAPAYAhuba2022papaya and FedAsyncxie2019asynchronous AFL systems illustrating classification accuracy using the FashionMNIST dataset under experimental setup 1.4 as described Table \ref{['table:experiment3']}, and hyperparameter settings shown in Table \ref{['tab:hyperparameters']}.