Mitigating System Bias in Resource Constrained Asynchronous Federated Learning Systems
Jikun Gao, Ioannis Mavromatis, Peizheng Li, Pietro Carnelli, Aftab Khan
TL;DR
The paper tackles system bias in resource-constrained asynchronous federated learning caused by device heterogeneity and non-IID data. It introduces a buffer-based dynamic aggregation that assigns a scaling factor to incoming updates using a staleness factor and the upload frequency, with the key formula $\beta_n = \frac{|D_n| \\cdot e^{s_n \\cdot \\frac{1}{f_n}}}{\sum_{b=1}^{B} |D_b| \\cdot e^{s_b \\cdot \\frac{1}{f_b}}}$ and $s = (t - \tau + 1)^{-{\\alpha}}$. The global model is updated after aggregating the scaled updates, and the latest global model is immediately provided to clients after each upload to minimize idling. Experiments on Fashion-MNIST with 10 simulated clients demonstrate substantial gains over PAPAYA and FedAsync, underscoring improved robustness and scalability for real-world AFL deployments.
Abstract
Federated learning (FL) systems face performance challenges in dealing with heterogeneous devices and non-identically distributed data across clients. We propose a dynamic global model aggregation method within Asynchronous Federated Learning (AFL) deployments to address these issues. Our aggregation method scores and adjusts the weighting of client model updates based on their upload frequency to accommodate differences in device capabilities. Additionally, we also immediately provide an updated global model to clients after they upload their local models to reduce idle time and improve training efficiency. We evaluate our approach within an AFL deployment consisting of 10 simulated clients with heterogeneous compute constraints and non-IID data. The simulation results, using the FashionMNIST dataset, demonstrate over 10% and 19% improvement in global model accuracy compared to state-of-the-art methods PAPAYA and FedAsync, respectively. Our dynamic aggregation method allows reliable global model training despite limiting client resources and statistical data heterogeneity. This improves robustness and scalability for real-world FL deployments.
