Model Predictive Control is almost Optimal for Heterogeneous Restless Multi-armed Bandits
Dheeraj Narasimha, Nicolas Gast
TL;DR
This work addresses the infinite-horizon, heterogeneous Restless Multi-Armed Bandit problem under a budget constraint by proposing a finite-horizon LP-update policy rooted in model predictive control. By solving a sequence of horizon-$\tau$ linear programs and applying randomized rounding, the policy yields a near-optimal long-run average reward under a mild ergodicity assumption, with a proven gap bound of $O\left(\frac{\log N}{\sqrt{N}}\right)$. The analysis introduces a dissipativity framework and a Jensen-gap concentration result to couple the finite-horizon fluid solution with the stochastic RMAB dynamics, and extends naturally to weakly coupled MDPs. Empirically, small horizons (e.g., $\tau=4$ or $5$) suffice to achieve strong performance, often rivaling or outperforming LP-priority methods, and illustrating a practical, scalable approach for constrained MDPs with heterogeneous components.
Abstract
We consider a general infinite horizon Heterogeneous Restless multi-armed Bandit (RMAB). Heterogeneity is a fundamental problem for many real-world systems largely because it resists many concentration arguments. In this paper, we assume that each of the $N$ arms can have different model parameters. We show that, under a mild assumption of uniform ergodicity, a natural finite-horizon LP-update policy with randomized rounding, that was originally proposed for the homogeneous case, achieves an $O(\log N\sqrt{1/N})$ optimality gap in infinite time average reward problems for fully heterogeneous RMABs. In doing so, we show results that provide strong theoretical guarantees on a well-known algorithm that works very well in practice. The LP-update policy is a model predictive approach that computes a decision at time $t$ by planing over a time-horizon $\{t\dots t+τ\}$. Our simulation section demonstrates that our algorithm works extremely well even when $τ$ is very small and set to $5$, which makes it computationally efficient. Our theoretical results draw on techniques from the model predictive control literature by invoking the concept of \emph{dissipativity} and generalize quite easily to the more general weakly coupled heterogeneous Markov Decision Process setting. In addition, we draw a parallel between our own policy and the LP-index policy by showing that the LP-index policy corresponds to $τ=1$. We describe where the latter's shortcomings arise from and how under our mild assumption we are able to address these shortcomings. The proof of our main theorem answers an open problem posed by (Brown et al 2020), paving the way for several new questions on the LP-update policies.
