A Federated Online Restless Bandit Framework for Cooperative Resource Allocation
Jingwen Tong, Xinran Li, Liqun Fu, Jun Zhang, Khaled B. Letaief
TL;DR
The paper tackles learning unknown dynamics in cooperative resource allocation by formulating a multi-agent online restless MAB problem and proposing a Federated Online RMAB framework. Central to the approach is FedTSWI, which combines federated Thompson sampling for dynamic estimation with a Whittle-index policy to guide arm selection, achieving privacy and communication efficiency. A regret bound of $\mathcal{R}(T)=\mathcal{O}(\sqrt{T \log T})$ is derived, and the framework is validated in an online multi-user multi-channel access case, showing fast convergence and improved sample efficiency as the number of agents grows. The results demonstrate a practical, scalable solution for dynamic spectrum access and similar distributed resource allocation problems with unknown system dynamics.
Abstract
Restless multi-armed bandits (RMABs) have been widely utilized to address resource allocation problems with Markov reward processes (MRPs). Existing works often assume that the dynamics of MRPs are known prior, which makes the RMAB problem solvable from an optimization perspective. Nevertheless, an efficient learning-based solution for RMABs with unknown system dynamics remains an open problem. In this paper, we study the cooperative resource allocation problem with unknown system dynamics of MRPs. This problem can be modeled as a multi-agent online RMAB problem, where multiple agents collaboratively learn the system dynamics while maximizing their accumulated rewards. We devise a federated online RMAB framework to mitigate the communication overhead and data privacy issue by adopting the federated learning paradigm. Based on this framework, we put forth a Federated Thompson Sampling-enabled Whittle Index (FedTSWI) algorithm to solve this multi-agent online RMAB problem. The FedTSWI algorithm enjoys a high communication and computation efficiency, and a privacy guarantee. Moreover, we derive a regret upper bound for the FedTSWI algorithm. Finally, we demonstrate the effectiveness of the proposed algorithm on the case of online multi-user multi-channel access. Numerical results show that the proposed algorithm achieves a fast convergence rate of $\mathcal{O}(\sqrt{T\log(T)})$ and better performance compared with baselines. More importantly, its sample complexity decreases with the number of agents.
