Global Rewards in Restless Multi-Armed Bandits
Naveen Raman, Zheyuan Ryan Shi, Fei Fang
TL;DR
The paper tackles the limitation of separable rewards in restless multi-armed bandits by introducing RMAB-G, which incorporates a global non-separable reward $R_{\glob}$ that can model complex objectives. It develops two index-based approaches, Linear-Whittle and Shapley-Whittle, to approximate the global reward, and proves approximation bounds under certain conditions, while highlighting potential failures for highly non-linear rewards. To overcome these limitations, the authors design adaptive policies—Iterative Linear-Whittle, Iterative Shapley-Whittle, and MCTS-based variants—that incorporate present rewards and plan deeper lookahead. Through extensive experiments on synthetic data and real-world food rescue data, these adaptive policies consistently outperform baselines and pre-computed index policies, achieving near-optimal performance in small-scale settings and substantial gains in larger, realistic scenarios. The work advances RMAB research by enabling effective decision-making in problems with global, non-separable rewards and demonstrates practical applicability to resource allocation in food rescue networks.
Abstract
Restless multi-armed bandits (RMAB) extend multi-armed bandits so pulling an arm impacts future states. Despite the success of RMABs, a key limiting assumption is the separability of rewards into a sum across arms. We address this deficiency by proposing restless-multi-armed bandit with global rewards (RMAB-G), a generalization of RMABs to global non-separable rewards. To solve RMAB-G, we develop the Linear- and Shapley-Whittle indices, which extend Whittle indices from RMABs to RMAB-Gs. We prove approximation bounds but also point out how these indices could fail when reward functions are highly non-linear. To overcome this, we propose two sets of adaptive policies: the first computes indices iteratively, and the second combines indices with Monte-Carlo Tree Search (MCTS). Empirically, we demonstrate that our proposed policies outperform baselines and index-based policies with synthetic data and real-world data from food rescue.
