Towards a Pretrained Model for Restless Bandits via Multi-arm Generalization
Yunfan Zhao, Nikhil Behari, Edward Hughes, Edwin Zhang, Dheeraj Nagaraj, Karl Tuyls, Aparna Taneja, Milind Tambe
TL;DR
Restless multi-arm bandits (RMABs) pose challenges due to unknown dynamics, streaming arm participation, and continuous state spaces. The paper introduces PreFeRMAB, a pretrained, flexible RMAB model that generalizes to unseen arms via multi-arm generalization, a novel $\lambda$-network updating rule, and a StateShaping module. It demonstrates strong zero-shot performance across synthetic, SIS epidemic, and ARMMAN maternal-health domains and shows faster fine-tuning with significantly fewer samples than training from scratch. This approach enables practical deployment in real-world resource allocation problems by reducing retraining requirements and accommodating dynamic arm participation.
Abstract
Restless multi-arm bandits (RMABs), a class of resource allocation problems with broad application in areas such as healthcare, online advertising, and anti-poaching, have recently been studied from a multi-agent reinforcement learning perspective. Prior RMAB research suffers from several limitations, e.g., it fails to adequately address continuous states, and requires retraining from scratch when arms opt-in and opt-out over time, a common challenge in many real world applications. We address these limitations by developing a neural network-based pre-trained model (PreFeRMAB) that has general zero-shot ability on a wide range of previously unseen RMABs, and which can be fine-tuned on specific instances in a more sample-efficient way than retraining from scratch. Our model also accommodates general multi-action settings and discrete or continuous state spaces. To enable fast generalization, we learn a novel single policy network model that utilizes feature information and employs a training procedure in which arms opt-in and out over time. We derive a new update rule for a crucial $λ$-network with theoretical convergence guarantees and empirically demonstrate the advantages of our approach on several challenging, real-world inspired problems.
