Table of Contents
Fetching ...

Achieving $\tilde{\mathcal{O}}(1/N)$ Optimality Gap in Restless Bandits through Gaussian Approximation

Chen Yan, Weina Wang, Lei Ying

TL;DR

This work tackles finite-horizon Restless Multi-Armed Bandits with $N$ homogeneous arms, where standard fluid (LP-based) policies can incur a $\Theta(1/\sqrt{N})$ per-arm gap in degenerate settings. The authors introduce a Gaussian stochastic system that augments the fluid approximation by capturing both mean and variance around the fluid optimum $\mathbf{y}^*$, and solve a Gaussian SP within a $\tilde{\Theta}(1/\sqrt{N})$-neighborhood of $\mathbf{y}^*$ to derive an SP-based policy. Under a Uniqueness Assumption, this SP-based policy achieves a global optimality gap of $\tilde{\mathcal{O}}(1/N)$, improving upon LP-based approaches that exhibit $\Theta(1/\sqrt{N})$ gaps; the paper also proves that without Uniqueness, the SP-based approach still offers meaningful improvements. The theoretical results are complemented by numerical experiments on machine-maintenance RMABs, demonstrating that the SP-based policy yields substantial gains over LP-based policies as $N$ grows, with computational methods (SAA/EDDP) scaling linearly in horizon and state space. Overall, the work provides a principled, scalable route to near-optimal policies for degenerate RMABs and highlights the value of variance-aware Gaussian approximations in stochastic decision problems.

Abstract

We study the finite-horizon Restless Multi-Armed Bandit (RMAB) problem with $N$ homogeneous arms. Prior work has shown that when an RMAB satisfies a non-degeneracy condition, Linear-Programming-based (LP-based) policies derived from the fluid approximation, which captures the mean dynamics of the system, achieve an exponentially small optimality gap. However, it is common for RMABs to be degenerate, in which case LP-based policies can result in a $Θ(1/\sqrt{N})$ optimality gap per arm. In this paper, we propose a novel Stochastic-Programming-based (SP-based) policy that, under a uniqueness assumption, achieves an $\tilde{\mathcal{O}}(1/N)$ optimality gap for degenerate RMABs. Our approach is based on the construction of a Gaussian stochastic system that captures not only the mean but also the variance of the RMAB dynamics, resulting in a more accurate approximation than the fluid approximation. We then solve a stochastic program for this system to obtain our policy. This is the first result to establish an $\tilde{\mathcal{O}}(1/N)$ optimality gap for degenerate RMABs.

Achieving $\tilde{\mathcal{O}}(1/N)$ Optimality Gap in Restless Bandits through Gaussian Approximation

TL;DR

This work tackles finite-horizon Restless Multi-Armed Bandits with homogeneous arms, where standard fluid (LP-based) policies can incur a per-arm gap in degenerate settings. The authors introduce a Gaussian stochastic system that augments the fluid approximation by capturing both mean and variance around the fluid optimum , and solve a Gaussian SP within a -neighborhood of to derive an SP-based policy. Under a Uniqueness Assumption, this SP-based policy achieves a global optimality gap of , improving upon LP-based approaches that exhibit gaps; the paper also proves that without Uniqueness, the SP-based approach still offers meaningful improvements. The theoretical results are complemented by numerical experiments on machine-maintenance RMABs, demonstrating that the SP-based policy yields substantial gains over LP-based policies as grows, with computational methods (SAA/EDDP) scaling linearly in horizon and state space. Overall, the work provides a principled, scalable route to near-optimal policies for degenerate RMABs and highlights the value of variance-aware Gaussian approximations in stochastic decision problems.

Abstract

We study the finite-horizon Restless Multi-Armed Bandit (RMAB) problem with homogeneous arms. Prior work has shown that when an RMAB satisfies a non-degeneracy condition, Linear-Programming-based (LP-based) policies derived from the fluid approximation, which captures the mean dynamics of the system, achieve an exponentially small optimality gap. However, it is common for RMABs to be degenerate, in which case LP-based policies can result in a optimality gap per arm. In this paper, we propose a novel Stochastic-Programming-based (SP-based) policy that, under a uniqueness assumption, achieves an optimality gap for degenerate RMABs. Our approach is based on the construction of a Gaussian stochastic system that captures not only the mean but also the variance of the RMAB dynamics, resulting in a more accurate approximation than the fluid approximation. We then solve a stochastic program for this system to obtain our policy. This is the first result to establish an optimality gap for degenerate RMABs.

Paper Structure

This paper contains 65 sections, 14 theorems, 220 equations, 5 figures, 3 tables, 1 algorithm.

Key Result

Theorem 4.1

Consider an RMAB that satisfies the Uniqueness Assumption ass:opt-lp-distance. Then the locally-SP-optimal policy, $\tilde{\pi}^{N,*}$, when applied to the $N$-system (with rounding), achieves an optimality gap of $\tilde{\mathcal{O}}(1/N)$; i.e., where $V^N_{\mathrm{opt}}$ is the optimal value function, and $V^N_{\tilde{\pi}^{N,*}}$ is the value function of $\tilde{\pi}^{N,*}$, both in the $N$-s

Figures (5)

  • Figure 1: Comparison of LP and SP-based policies on a 2 states 2 steps RMAB example.
  • Figure 2: Proof structure of Theorem \ref{['thm:global']}.
  • Figure 3: Comparison of LP and SP-based policies on a machine maintenance example. Top row: reward per arm. Bottom row: total reward difference (SP minus LP) with 2-sigma error bars.
  • Figure 4: The computation time of EDDP.
  • Figure 5: The one-armed MDP with two states. The only non-zero reward is under action $1$ in state $1$.

Theorems & Definitions (23)

  • Definition 3.1: Non-degeneracy zhang2021restlessgast2023linearbrown2023fluid
  • Theorem 4.1: Global optimality
  • Theorem 4.2: Fluid gap
  • Theorem 4.3: Performance improvement
  • Theorem C.1: Global optimality
  • Lemma C.1: Proximity of optimal policy
  • Lemma C.2: Value difference
  • Lemma C.3: Evaluation difference
  • proof : Proof of Theorem \ref{['thm:global']}
  • Claim C.1
  • ...and 13 more