Finite-Horizon Single-Pull Restless Bandits: An Efficient Index Policy For Scarce Resource Allocation
Guojun Xiong, Haichuan Wang, Yuqi Pan, Saptarshi Mandal, Sanket Shah, Niclas Boehmer, Milind Tambe
TL;DR
This work addresses scarce-resource allocation where each agent/arm can receive at most one intervention within a finite horizon. It introduces Finite-Horizon Single-Pull RMABs (SPRMABs) and a lightweight Single-Pull Index (SPI) policy built on a dummy-state expansion that enforces the one-pull constraint. The authors prove asymptotic optimality and a finite-sample bound on the average optimality gap, showing near-optimal performance (gap decaying as $\tilde{\mathcal{O}}(\frac{1}{\rho^{1/2}} + \frac{1}{\rho^{3/2}})$) and validate the approach across healthcare and other domains with extensive simulations. The method is computationally efficient, does not require indexability, and substantially outperforms traditional Whittle and LP-based policies under the single-pull constraint, highlighting its practical impact for fair and efficient scarce-resource allocation.
Abstract
Restless multi-armed bandits (RMABs) have been highly successful in optimizing sequential resource allocation across many domains. However, in many practical settings with highly scarce resources, where each agent can only receive at most one resource, such as healthcare intervention programs, the standard RMAB framework falls short. To tackle such scenarios, we introduce Finite-Horizon Single-Pull RMABs (SPRMABs), a novel variant in which each arm can only be pulled once. This single-pull constraint introduces additional complexity, rendering many existing RMAB solutions suboptimal or ineffective. %To address this, we propose using dummy states to duplicate the system, ensuring that once an arm is activated, it transitions exclusively within the dummy states. To address this shortcoming, we propose using \textit{dummy states} that expand the system and enforce the one-pull constraint. We then design a lightweight index policy for this expanded system. For the first time, we demonstrate that our index policy achieves a sub-linearly decaying average optimality gap of $\tilde{\mathcal{O}}\left(\frac{1}{ρ^{1/2}}\right)$ for a finite number of arms, where $ρ$ is the scaling factor for each arm cluster. Extensive simulations validate the proposed method, showing robust performance across various domains compared to existing benchmarks.
