Table of Contents
Fetching ...

From Restless to Contextual: A Thresholding Bandit Reformulation For Finite-horizon Performance

Jiamin Xu, Ivan Nazarov, Aditya Rastogi, África Periáñez, Kyra Gan

TL;DR

This work tackles the practical limitation of poor finite-horizon performance in online restless bandits by reformulating RB as a budgeted thresholding contextual CMAB (BT-CMAB). By embedding long-term dynamics into a contextual reward $\phi^m(s,a)$ and applying a threshold $\gamma$ on incremental gains $I^m(s)$, the method avoids explicit MDP estimation and achieves fast, sample-efficient learning. The authors establish a non-asymptotic optimality result for a 2-state homogeneous RB under the BT-CMAB reduction and propose an epsilon-Greedy Thresholding algorithm with sublinear regret, backed by a rigorous bound and concentration-based analysis. Empirically, the approach yields faster convergence and higher cumulative rewards in large-scale heterogeneous environments, outperforming state-of-the-art online RB methods in finite-horizon settings. This work provides a practical pathway to efficient, scalable RB policies with strong finite-horizon guarantees.

Abstract

This paper addresses the poor finite-horizon performance of existing online \emph{restless bandit} (RB) algorithms, which stems from the prohibitive sample complexity of learning a full \emph{Markov decision process} (MDP) for each agent. We argue that superior finite-horizon performance requires \emph{rapid convergence} to a \emph{high-quality} policy. Thus motivated, we introduce a reformulation of online RBs as a \emph{budgeted thresholding contextual bandit}, which simplifies the learning problem by encoding long-term state transitions into a scalar reward. We prove the first non-asymptotic optimality of an oracle policy for a simplified finite-horizon setting. We propose a practical learning policy under a heterogeneous-agent, multi-state setting, and show that it achieves a sublinear regret, achieving \emph{faster convergence} than existing methods. This directly translates to higher cumulative reward, as empirically validated by significant gains over state-of-the-art algorithms in large-scale heterogeneous environments. Our work provides a new pathway for achieving practical, sample-efficient learning in finite-horizon RBs.

From Restless to Contextual: A Thresholding Bandit Reformulation For Finite-horizon Performance

TL;DR

This work tackles the practical limitation of poor finite-horizon performance in online restless bandits by reformulating RB as a budgeted thresholding contextual CMAB (BT-CMAB). By embedding long-term dynamics into a contextual reward and applying a threshold on incremental gains , the method avoids explicit MDP estimation and achieves fast, sample-efficient learning. The authors establish a non-asymptotic optimality result for a 2-state homogeneous RB under the BT-CMAB reduction and propose an epsilon-Greedy Thresholding algorithm with sublinear regret, backed by a rigorous bound and concentration-based analysis. Empirically, the approach yields faster convergence and higher cumulative rewards in large-scale heterogeneous environments, outperforming state-of-the-art online RB methods in finite-horizon settings. This work provides a practical pathway to efficient, scalable RB policies with strong finite-horizon guarantees.

Abstract

This paper addresses the poor finite-horizon performance of existing online \emph{restless bandit} (RB) algorithms, which stems from the prohibitive sample complexity of learning a full \emph{Markov decision process} (MDP) for each agent. We argue that superior finite-horizon performance requires \emph{rapid convergence} to a \emph{high-quality} policy. Thus motivated, we introduce a reformulation of online RBs as a \emph{budgeted thresholding contextual bandit}, which simplifies the learning problem by encoding long-term state transitions into a scalar reward. We prove the first non-asymptotic optimality of an oracle policy for a simplified finite-horizon setting. We propose a practical learning policy under a heterogeneous-agent, multi-state setting, and show that it achieves a sublinear regret, achieving \emph{faster convergence} than existing methods. This directly translates to higher cumulative reward, as empirically validated by significant gains over state-of-the-art algorithms in large-scale heterogeneous environments. Our work provides a new pathway for achieving practical, sample-efficient learning in finite-horizon RBs.

Paper Structure

This paper contains 36 sections, 9 theorems, 111 equations, 11 figures, 1 table.

Key Result

Theorem 3.1

Assuming all agents are homogeneous, i.e., $P^m=P^{m'}, R^m=R^{m'},\forall m,m'\in[M]$, $\bm{\pi}^{\text{greedy}}=\bm{\pi}^*$ for any $0\leq B\leq M<\infty$.

Figures (11)

  • Figure 1: Average (over time) cumulative reward for budgets $B=5$ (left), $10$ (middle), $20$ (right), averaged over 50 instances with 13 repetitions each. Top: noiseless reward; bottom: noisy reward.
  • Figure C.2: Dynamic of Agent Type 1 with Intervention
  • Figure C.3: Dynamic of Agent Type 1 without Intervention
  • Figure C.4: Dynamic of Agent Type 2 with Intervention
  • Figure C.5: Dynamic of Agent Type 2 without Intervention
  • ...and 6 more figures

Theorems & Definitions (17)

  • Definition 2.2: Regret as Convergence Rate
  • Remark 2.3
  • Theorem 3.1
  • Remark 3.2: Difficulty of Relaxation
  • Lemma 4.2: Existence of Sufficient Good Agent-State Pairs
  • Theorem 4.3: Regret Upperbound
  • Remark 4.4: How to Select the Threshold $\gamma$?
  • Theorem B.1: Theorem 2, Weber_Weiss_1990
  • Theorem E.1
  • proof
  • ...and 7 more