Table of Contents
Fetching ...

Restless Bandits with Average Reward: Breaking the Uniform Global Attractor Assumption

Yige Hong, Qiaomin Xie, Yudong Chen, Weina Wang

TL;DR

This work tackles infinite-horizon restless bandits with average reward, seeking policies whose performance gap vanishes as the number of arms grows. It introduces Follow-the-Virtual-Advice (FTVA) and its continuous-time variant to convert any single-armed policy into an $N$-armed policy, achieving an $O(1/\sqrt{N})$ optimality gap without relying on the Uniform Global Attractor Property (UGAP). In the discrete-time setting, the result holds under a Synchronization Assumption (SA), while in continuous time the bound holds under the standard unichain condition with no additional assumptions. The approach reduces computational complexity to solving a single-armed LP and enables linear-in-$N$ implementation, with extensions to heterogeneous arms and a clear path toward UGAP-free asymptotic optimality in restless bandits.

Abstract

We study the infinite-horizon restless bandit problem with the average reward criterion, in both discrete-time and continuous-time settings. A fundamental goal is to efficiently compute policies that achieve a diminishing optimality gap as the number of arms, $N$, grows large. Existing results on asymptotic optimality all rely on the uniform global attractor property (UGAP), a complex and challenging-to-verify assumption. In this paper, we propose a general, simulation-based framework, Follow-the-Virtual-Advice, that converts any single-armed policy into a policy for the original $N$-armed problem. This is done by simulating the single-armed policy on each arm and carefully steering the real state towards the simulated state. Our framework can be instantiated to produce a policy with an $O(1/\sqrt{N})$ optimality gap. In the discrete-time setting, our result holds under a simpler synchronization assumption, which covers some problem instances that violate UGAP. More notably, in the continuous-time setting, we do not require \emph{any} additional assumptions beyond the standard unichain condition. In both settings, our work is the first asymptotic optimality result that does not require UGAP.

Restless Bandits with Average Reward: Breaking the Uniform Global Attractor Assumption

TL;DR

This work tackles infinite-horizon restless bandits with average reward, seeking policies whose performance gap vanishes as the number of arms grows. It introduces Follow-the-Virtual-Advice (FTVA) and its continuous-time variant to convert any single-armed policy into an -armed policy, achieving an optimality gap without relying on the Uniform Global Attractor Property (UGAP). In the discrete-time setting, the result holds under a Synchronization Assumption (SA), while in continuous time the bound holds under the standard unichain condition with no additional assumptions. The approach reduces computational complexity to solving a single-armed LP and enables linear-in- implementation, with extensions to heterogeneous arms and a clear path toward UGAP-free asymptotic optimality in restless bandits.

Abstract

We study the infinite-horizon restless bandit problem with the average reward criterion, in both discrete-time and continuous-time settings. A fundamental goal is to efficiently compute policies that achieve a diminishing optimality gap as the number of arms, , grows large. Existing results on asymptotic optimality all rely on the uniform global attractor property (UGAP), a complex and challenging-to-verify assumption. In this paper, we propose a general, simulation-based framework, Follow-the-Virtual-Advice, that converts any single-armed policy into a policy for the original -armed problem. This is done by simulating the single-armed policy on each arm and carefully steering the real state towards the simulated state. Our framework can be instantiated to produce a policy with an optimality gap. In the discrete-time setting, our result holds under a simpler synchronization assumption, which covers some problem instances that violate UGAP. More notably, in the continuous-time setting, we do not require \emph{any} additional assumptions beyond the standard unichain condition. In both settings, our work is the first asymptotic optimality result that does not require UGAP.
Paper Structure (53 sections, 20 theorems, 75 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 53 sections, 20 theorems, 75 equations, 8 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

Consider an $N$-armed RB problem $(N, \mathbb{S}^N, \mathbb{A}^N, P, r, \alpha N)$ under the single-armed unichain assumption. Let ${\bar{\pi}}$ be any single-armed policy satisfying SA. For any $N\ge 1$, the conversion loss of FTVA satisfies the upper bound where $r_{\max} \triangleq \max_{s\in\mathbb{S},a\in\mathbb{A}}|r(s,a)|$ and $\overline{\tau}^{\textup{sync}}_{\max} \triangleq \max_{(s, a,

Figures (8)

  • Figure 1: An discrete-time RB problem that satisfies SA but not UGAP.
  • Figure 2: Time evolution of the fraction of arms in each state under LP-Priority (upper), or after switching to $\textup{FTVA}\xspace({\bar{\pi}}^*)$ (lower) since time slot $250$. The x-axis represents the time slot, which ranges from $250$ to $289$; the y-axis represents the states; the color represents the fraction of arms in each state at each time slot. The colors and magnitudes of the arrows represent the average directions and rates at which the arms move away from each state.
  • Figure 3: Comparing policies based on virtual states and real states
  • Figure 4: The illustration of the positive probability sample paths that lead to synchronization in Proposition \ref{['prop:self-loop-two-states']} (left) and Proposition \ref{['prop:self-loop-one-state']} (right). In each figure, the dotted arrows correspond to the sample path of the leader arm's state, while the solid arrows correspond to the sample path of the follower arm's state. The numbers near the arrows denote the temporal order of the transition events.
  • Figure 5: The illustration of the positive probability sample paths that lead to synchronization in Proposition \ref{['prop:two-cycles-imply-synchronization']} (left) and Proposition \ref{['prop:one-cycle-imply-synchronization']} (right). In each figure, the dotted arrows correspond to the sample path of the leader arm's state, while the solid arrows correspond to the sample path of the follower arm's state. The numbers near the arrows denote the temporal order of the transition events.
  • ...and 3 more figures

Theorems & Definitions (42)

  • Theorem 1
  • Theorem 2
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Remark 1
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • ...and 32 more