Table of Contents
Fetching ...

Model Predictive Control is almost Optimal for Heterogeneous Restless Multi-armed Bandits

Dheeraj Narasimha, Nicolas Gast

TL;DR

This work addresses the infinite-horizon, heterogeneous Restless Multi-Armed Bandit problem under a budget constraint by proposing a finite-horizon LP-update policy rooted in model predictive control. By solving a sequence of horizon-$\tau$ linear programs and applying randomized rounding, the policy yields a near-optimal long-run average reward under a mild ergodicity assumption, with a proven gap bound of $O\left(\frac{\log N}{\sqrt{N}}\right)$. The analysis introduces a dissipativity framework and a Jensen-gap concentration result to couple the finite-horizon fluid solution with the stochastic RMAB dynamics, and extends naturally to weakly coupled MDPs. Empirically, small horizons (e.g., $\tau=4$ or $5$) suffice to achieve strong performance, often rivaling or outperforming LP-priority methods, and illustrating a practical, scalable approach for constrained MDPs with heterogeneous components.

Abstract

We consider a general infinite horizon Heterogeneous Restless multi-armed Bandit (RMAB). Heterogeneity is a fundamental problem for many real-world systems largely because it resists many concentration arguments. In this paper, we assume that each of the $N$ arms can have different model parameters. We show that, under a mild assumption of uniform ergodicity, a natural finite-horizon LP-update policy with randomized rounding, that was originally proposed for the homogeneous case, achieves an $O(\log N\sqrt{1/N})$ optimality gap in infinite time average reward problems for fully heterogeneous RMABs. In doing so, we show results that provide strong theoretical guarantees on a well-known algorithm that works very well in practice. The LP-update policy is a model predictive approach that computes a decision at time $t$ by planing over a time-horizon $\{t\dots t+τ\}$. Our simulation section demonstrates that our algorithm works extremely well even when $τ$ is very small and set to $5$, which makes it computationally efficient. Our theoretical results draw on techniques from the model predictive control literature by invoking the concept of \emph{dissipativity} and generalize quite easily to the more general weakly coupled heterogeneous Markov Decision Process setting. In addition, we draw a parallel between our own policy and the LP-index policy by showing that the LP-index policy corresponds to $τ=1$. We describe where the latter's shortcomings arise from and how under our mild assumption we are able to address these shortcomings. The proof of our main theorem answers an open problem posed by (Brown et al 2020), paving the way for several new questions on the LP-update policies.

Model Predictive Control is almost Optimal for Heterogeneous Restless Multi-armed Bandits

TL;DR

This work addresses the infinite-horizon, heterogeneous Restless Multi-Armed Bandit problem under a budget constraint by proposing a finite-horizon LP-update policy rooted in model predictive control. By solving a sequence of horizon- linear programs and applying randomized rounding, the policy yields a near-optimal long-run average reward under a mild ergodicity assumption, with a proven gap bound of . The analysis introduces a dissipativity framework and a Jensen-gap concentration result to couple the finite-horizon fluid solution with the stochastic RMAB dynamics, and extends naturally to weakly coupled MDPs. Empirically, small horizons (e.g., or ) suffice to achieve strong performance, often rivaling or outperforming LP-priority methods, and illustrating a practical, scalable approach for constrained MDPs with heterogeneous components.

Abstract

We consider a general infinite horizon Heterogeneous Restless multi-armed Bandit (RMAB). Heterogeneity is a fundamental problem for many real-world systems largely because it resists many concentration arguments. In this paper, we assume that each of the arms can have different model parameters. We show that, under a mild assumption of uniform ergodicity, a natural finite-horizon LP-update policy with randomized rounding, that was originally proposed for the homogeneous case, achieves an optimality gap in infinite time average reward problems for fully heterogeneous RMABs. In doing so, we show results that provide strong theoretical guarantees on a well-known algorithm that works very well in practice. The LP-update policy is a model predictive approach that computes a decision at time by planing over a time-horizon . Our simulation section demonstrates that our algorithm works extremely well even when is very small and set to , which makes it computationally efficient. Our theoretical results draw on techniques from the model predictive control literature by invoking the concept of \emph{dissipativity} and generalize quite easily to the more general weakly coupled heterogeneous Markov Decision Process setting. In addition, we draw a parallel between our own policy and the LP-index policy by showing that the LP-index policy corresponds to . We describe where the latter's shortcomings arise from and how under our mild assumption we are able to address these shortcomings. The proof of our main theorem answers an open problem posed by (Brown et al 2020), paving the way for several new questions on the LP-update policies.

Paper Structure

This paper contains 32 sections, 21 theorems, 112 equations, 4 figures, 1 algorithm.

Key Result

Lemma 1

The gain is an upper bound on the maximum value that can be obtained by $\mathbf{V}_{\text{OPT}}$, $\mathbf{V}_{\text{OPT}} \leq \mathbf{g}^{\star}$.

Figures (4)

  • Figure 1: Comparison of the normalized average reward for the three policies on random examples. We study the influence of various parameters ($\alpha$, $N$ and $\tau$)
  • Figure 2: Comparison of the normalized average reward of the three policies for the three counter examples to the LP-priority Policy.
  • Figure 3: Comparison of the average normalized reward as a function of $\tau$, the time horizon used in the LP, for the three counter examples. Recall that the LP-priority and the ID reassignement policies do not depend on $\tau$ (which is why their performance is a straight line as a function of $\tau$).
  • Figure 4: A relationship diagram between the state and policy variables constructed for the proof over one time step

Theorems & Definitions (43)

  • Definition 1
  • Lemma 1
  • Definition 2
  • Definition 3
  • Remark 1
  • Lemma 2
  • Remark 2
  • Theorem 1
  • Lemma 3
  • Lemma 4
  • ...and 33 more