Towards a Pretrained Model for Restless Bandits via Multi-arm Generalization

Yunfan Zhao; Nikhil Behari; Edward Hughes; Edwin Zhang; Dheeraj Nagaraj; Karl Tuyls; Aparna Taneja; Milind Tambe

Towards a Pretrained Model for Restless Bandits via Multi-arm Generalization

Yunfan Zhao, Nikhil Behari, Edward Hughes, Edwin Zhang, Dheeraj Nagaraj, Karl Tuyls, Aparna Taneja, Milind Tambe

TL;DR

Restless multi-arm bandits (RMABs) pose challenges due to unknown dynamics, streaming arm participation, and continuous state spaces. The paper introduces PreFeRMAB, a pretrained, flexible RMAB model that generalizes to unseen arms via multi-arm generalization, a novel $\lambda$-network updating rule, and a StateShaping module. It demonstrates strong zero-shot performance across synthetic, SIS epidemic, and ARMMAN maternal-health domains and shows faster fine-tuning with significantly fewer samples than training from scratch. This approach enables practical deployment in real-world resource allocation problems by reducing retraining requirements and accommodating dynamic arm participation.

Abstract

Restless multi-arm bandits (RMABs), a class of resource allocation problems with broad application in areas such as healthcare, online advertising, and anti-poaching, have recently been studied from a multi-agent reinforcement learning perspective. Prior RMAB research suffers from several limitations, e.g., it fails to adequately address continuous states, and requires retraining from scratch when arms opt-in and opt-out over time, a common challenge in many real world applications. We address these limitations by developing a neural network-based pre-trained model (PreFeRMAB) that has general zero-shot ability on a wide range of previously unseen RMABs, and which can be fine-tuned on specific instances in a more sample-efficient way than retraining from scratch. Our model also accommodates general multi-action settings and discrete or continuous state spaces. To enable fast generalization, we learn a novel single policy network model that utilizes feature information and employs a training procedure in which arms opt-in and out over time. We derive a new update rule for a crucial $λ$-network with theoretical convergence guarantees and empirically demonstrate the advantages of our approach on several challenging, real-world inspired problems.

Towards a Pretrained Model for Restless Bandits via Multi-arm Generalization

TL;DR

-network updating rule, and a StateShaping module. It demonstrates strong zero-shot performance across synthetic, SIS epidemic, and ARMMAN maternal-health domains and shows faster fine-tuning with significantly fewer samples than training from scratch. This approach enables practical deployment in real-world resource allocation problems by reducing retraining requirements and accommodating dynamic arm participation.

Abstract

-network with theoretical convergence guarantees and empirically demonstrate the advantages of our approach on several challenging, real-world inspired problems.

Paper Structure (29 sections, 6 theorems, 39 equations, 4 figures, 16 tables, 3 algorithms)

This paper contains 29 sections, 6 theorems, 39 equations, 4 figures, 16 tables, 3 algorithms.

Introduction
Related Work
Problem Statement
Generalized Model for RMABs
Key Algorithmic Ideas
A Pretrained Model via Multi-arm Generalization
A Novel $\lambda$-network Updating Rule
Handling Continuous States with StateShaping
Inference using Pretrained Model
Experimental Evaluation
Experimental Settings
PreFeRMAB Zero-Shot Learning
PreFeRMAB Fast Fine-Tuning
Conclusion
Additional Experimental Details
...and 14 more sections

Key Result

Proposition 1

Suppose the following assumptions hold: Then, the generalization error over unseen arms ($\hat{\mu}$) satisfies: Here, $\tilde{O}$ hides polylogarithmic factors in $n_{\mathsf{epochs}},N$ and constants depending on $d,D,L,\beta,\frac{B}{N},c_j,R_{\max}$ and $\lambda_{\max}$

Figures (4)

Figure 1: Comparison of samples per arm required by DDLPO and PreFeRMAB (fine-tuning using a pretrained model) to achieve maximum DDLPO reward across different environments. PreFeRMAB achieves the maximum topline reward with significantly fewer samples than DDLPO. Averages across training seeds are reported as interquartile means.
Figure 2: Overview of the PreFeRMAB training procedure. A trained model consists of a policy network, a critic network, a $\lambda$-network, and a StateShaping module. Arm states $s_i$, features $z_i$, and opt-in decisions $\xi$ are passed through the policy network with an action-charge $\lambda$. The policy network independently predicts action probabilities for each arm, which are then greedily selected until the specified budget is reached. These selected actions are used with arm state, feature, and opt-in information to update the $\lambda$-network. Updated arm states $s'$ and rewards $r$ from the environment are then added to the buffer, and passed through the state abstraction module before being fed back through the policy network.
Figure 3: Comparison of the percentage of the final DDLPO (Killian et al. [2022] topline) reward achieved by the number of samples per arm. In DDLPO, samples are used for training from scratch; in PreFeRMAB, samples are used to fine-tune a pretrained PreFeRMAB model. Results indicate that PreFeRMAB, from zero-shot results, achieves near-optimal performance, and requires a small fraction of the required DDLPO samples to achieve final DDLPO performance.
Figure 4: Illustration for StateShaping.

Theorems & Definitions (10)

Proposition 1
Proposition 2
Proposition 3: Convergence of $\lambda$-network
Lemma 1
proof
Lemma 2
proof
Lemma 3
proof : Proof of Proposition \ref{['prop:lambda_updating_rule']}
proof : Proof of Proposition \ref{['prop:convergence_lambda']}

Towards a Pretrained Model for Restless Bandits via Multi-arm Generalization

TL;DR

Abstract

Towards a Pretrained Model for Restless Bandits via Multi-arm Generalization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (10)