Load Balancing in Federated Learning
Alireza Javani, Zhiying Wang
TL;DR
This work tackles load balancing in Federated Learning under partial participation by introducing a load metric $X$ tied to Age of Information (AoI) and aiming to minimize $\operatorname{Var}[X]$ under $P(S_i^{(t)}=1)=\frac{k}{n}$. It advocates a decentralized Markov scheduling policy where each client's AoI governs its participation, and derives optimal transition probabilities, showing equivalence to oldest-age selection in many regimes. Through simulations on MNIST, CIFAR-10, and CIFAR-100 with $n=100$, $k=15$, the Markov policy achieves faster convergence than random selection (e.g., CIFAR-10: 240 vs 265 rounds; CIFAR-100: 500 vs 600+ rounds) and improves fairness by reducing update stale- ness. The results highlight the practical benefits of minimizing $\operatorname{Var}[X]$ for robustness to data heterogeneity and dynamic network conditions.
Abstract
Federated Learning (FL) is a decentralized machine learning framework that enables learning from data distributed across multiple remote devices, enhancing communication efficiency and data privacy. Due to limited communication resources, a scheduling policy is often applied to select a subset of devices for participation in each FL round. The scheduling process confronts significant challenges due to the need for fair workload distribution, efficient resource utilization, scalability in environments with numerous edge devices, and statistically heterogeneous data across devices. This paper proposes a load metric for scheduling policies based on the Age of Information and addresses the above challenges by minimizing the load metric variance across the clients. Furthermore, a decentralized Markov scheduling policy is presented, that ensures a balanced workload distribution while eliminating the management overhead irrespective of the network size due to independent client decision-making. We establish the optimal parameters of the Markov chain model and validate our approach through simulations. The results demonstrate that reducing the load metric variance not only promotes fairness and improves operational efficiency, but also enhances the convergence rate of the learning models.
