Table of Contents
Fetching ...

Global Rewards in Restless Multi-Armed Bandits

Naveen Raman, Zheyuan Ryan Shi, Fei Fang

TL;DR

The paper tackles the limitation of separable rewards in restless multi-armed bandits by introducing RMAB-G, which incorporates a global non-separable reward $R_{\glob}$ that can model complex objectives. It develops two index-based approaches, Linear-Whittle and Shapley-Whittle, to approximate the global reward, and proves approximation bounds under certain conditions, while highlighting potential failures for highly non-linear rewards. To overcome these limitations, the authors design adaptive policies—Iterative Linear-Whittle, Iterative Shapley-Whittle, and MCTS-based variants—that incorporate present rewards and plan deeper lookahead. Through extensive experiments on synthetic data and real-world food rescue data, these adaptive policies consistently outperform baselines and pre-computed index policies, achieving near-optimal performance in small-scale settings and substantial gains in larger, realistic scenarios. The work advances RMAB research by enabling effective decision-making in problems with global, non-separable rewards and demonstrates practical applicability to resource allocation in food rescue networks.

Abstract

Restless multi-armed bandits (RMAB) extend multi-armed bandits so pulling an arm impacts future states. Despite the success of RMABs, a key limiting assumption is the separability of rewards into a sum across arms. We address this deficiency by proposing restless-multi-armed bandit with global rewards (RMAB-G), a generalization of RMABs to global non-separable rewards. To solve RMAB-G, we develop the Linear- and Shapley-Whittle indices, which extend Whittle indices from RMABs to RMAB-Gs. We prove approximation bounds but also point out how these indices could fail when reward functions are highly non-linear. To overcome this, we propose two sets of adaptive policies: the first computes indices iteratively, and the second combines indices with Monte-Carlo Tree Search (MCTS). Empirically, we demonstrate that our proposed policies outperform baselines and index-based policies with synthetic data and real-world data from food rescue.

Global Rewards in Restless Multi-Armed Bandits

TL;DR

The paper tackles the limitation of separable rewards in restless multi-armed bandits by introducing RMAB-G, which incorporates a global non-separable reward $R_{\glob}$ that can model complex objectives. It develops two index-based approaches, Linear-Whittle and Shapley-Whittle, to approximate the global reward, and proves approximation bounds under certain conditions, while highlighting potential failures for highly non-linear rewards. To overcome these limitations, the authors design adaptive policies—Iterative Linear-Whittle, Iterative Shapley-Whittle, and MCTS-based variants—that incorporate present rewards and plan deeper lookahead. Through extensive experiments on synthetic data and real-world food rescue data, these adaptive policies consistently outperform baselines and pre-computed index policies, achieving near-optimal performance in small-scale settings and substantial gains in larger, realistic scenarios. The work advances RMAB research by enabling effective decision-making in problems with global, non-separable rewards and demonstrates practical applicability to resource allocation in food rescue networks.

Abstract

Restless multi-armed bandits (RMAB) extend multi-armed bandits so pulling an arm impacts future states. Despite the success of RMABs, a key limiting assumption is the separability of rewards into a sum across arms. We address this deficiency by proposing restless-multi-armed bandit with global rewards (RMAB-G), a generalization of RMABs to global non-separable rewards. To solve RMAB-G, we develop the Linear- and Shapley-Whittle indices, which extend Whittle indices from RMABs to RMAB-Gs. We prove approximation bounds but also point out how these indices could fail when reward functions are highly non-linear. To overcome this, we propose two sets of adaptive policies: the first computes indices iteratively, and the second combines indices with Monte-Carlo Tree Search (MCTS). Empirically, we demonstrate that our proposed policies outperform baselines and index-based policies with synthetic data and real-world data from food rescue.
Paper Structure (32 sections, 11 theorems, 28 equations, 12 figures, 1 table)

This paper contains 32 sections, 11 theorems, 28 equations, 12 figures, 1 table.

Key Result

Theorem 1

The restless multi-armed bandit with global rewards is PSPACE-Hard.

Figures (12)

  • Figure 1: We compare baselines to our index-based and adaptive policies across four reward functions. All of our policies outperform baselines. Across all rewards, our best policy is within 3% of optimal for $N=4$. Among our policies, Iterative and MCTS Shapley-Whittle consistently perform best.
  • Figure 2: We plot policy performance for instances of the Subset reward which vary in linearity. We see that the Iterative and MCTS Shapley-Whittle policies outperform alternatives for non-linear rewards (small $\theta_{\mathrm{linear}}$) while policy performances converge for linear rewards (large $\theta_{\mathrm{linear}}$).
  • Figure 3: We compare the efficiency and performance of Iterative Shapley-Whittle methods which vary in Monte Carlo samples used for Shapley estimation and MCTS Shapley-Whittle methods which vary in MCTS iterations. While Iterative Shapley-Whittle is the slowest, decreasing the number of Monte Carlo samples can improve efficiency without impacting performance.
  • Figure 4: We plot the (a) transition probabilities obtained by clustering volunteers with various levels of experience and (b) the distribution of match probabilities across all volunteers. We see that as volunteers gain more experience, they are more likely to become engaged when notified. However, we see that most volunteers have a low probability of responding, which motivates the need for large budgets in notification.
  • Figure 5: We compare the performance of vanilla MCTS algorithms when varying (a) depth of exploration and (b) the number of test iterations. We find that, no matter the choice of depth or test iterations, MCTS performs worse than Linear-Whittle. Moreover, as expected, increasing test iterations improves performance, while lower depth improves MCTS performance due to the difficulty in estimating rewards from transitions.
  • ...and 7 more figures

Theorems & Definitions (29)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Definition 2
  • Definition 3
  • Theorem 4
  • Theorem 5
  • Corollary 5.1
  • Example 4.1
  • ...and 19 more