Table of Contents
Fetching ...

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

Guojun Xiong, Jian Li

TL;DR

This work tackles adversarial RMAB with unknown transition dynamics and bandit feedback, introducing UCMD-ARMAB to learn under an instantaneous activation constraint. It combines confidence-set based transition estimation, a time-dependent occupancy-measure formulation solved by Online Mirror Descent, a biased adversarial reward estimator for bandit feedback, and a low-complexity Reward Maximizing Index policy to enforce feasibility. The authors prove a regret bound of $\tilde{\mathcal{O}}(H\sqrt{T})$ (up to problem constants) and demonstrate empirical gains over a stochastic baseline in realistic RMAB-like tasks, highlighting practical viability in non-stationary, adversarial environments. The results advance the theoretical and practical understanding of learning in adversarial RMAB with unknown transitions and limited feedback, enabling provably efficient decision-making under tight instantaneous activation constraints.

Abstract

Restless multi-armed bandits (RMAB) play a central role in modeling sequential decision making problems under an instantaneous activation constraint that at most B arms can be activated at any decision epoch. Each restless arm is endowed with a state that evolves independently according to a Markov decision process regardless of being activated or not. In this paper, we consider the task of learning in episodic RMAB with unknown transition functions and adversarial rewards, which can change arbitrarily across episodes. Further, we consider a challenging but natural bandit feedback setting that only adversarial rewards of activated arms are revealed to the decision maker (DM). The goal of the DM is to maximize its total adversarial rewards during the learning process while the instantaneous activation constraint must be satisfied in each decision epoch. We develop a novel reinforcement learning algorithm with two key contributors: a novel biased adversarial reward estimator to deal with bandit feedback and unknown transitions, and a low-complexity index policy to satisfy the instantaneous activation constraint. We show $\tilde{\mathcal{O}}(H\sqrt{T})$ regret bound for our algorithm, where $T$ is the number of episodes and $H$ is the episode length. To our best knowledge, this is the first algorithm to ensure $\tilde{\mathcal{O}}(\sqrt{T})$ regret for adversarial RMAB in our considered challenging settings.

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

TL;DR

This work tackles adversarial RMAB with unknown transition dynamics and bandit feedback, introducing UCMD-ARMAB to learn under an instantaneous activation constraint. It combines confidence-set based transition estimation, a time-dependent occupancy-measure formulation solved by Online Mirror Descent, a biased adversarial reward estimator for bandit feedback, and a low-complexity Reward Maximizing Index policy to enforce feasibility. The authors prove a regret bound of (up to problem constants) and demonstrate empirical gains over a stochastic baseline in realistic RMAB-like tasks, highlighting practical viability in non-stationary, adversarial environments. The results advance the theoretical and practical understanding of learning in adversarial RMAB with unknown transitions and limited feedback, enabling provably efficient decision-making under tight instantaneous activation constraints.

Abstract

Restless multi-armed bandits (RMAB) play a central role in modeling sequential decision making problems under an instantaneous activation constraint that at most B arms can be activated at any decision epoch. Each restless arm is endowed with a state that evolves independently according to a Markov decision process regardless of being activated or not. In this paper, we consider the task of learning in episodic RMAB with unknown transition functions and adversarial rewards, which can change arbitrarily across episodes. Further, we consider a challenging but natural bandit feedback setting that only adversarial rewards of activated arms are revealed to the decision maker (DM). The goal of the DM is to maximize its total adversarial rewards during the learning process while the instantaneous activation constraint must be satisfied in each decision epoch. We develop a novel reinforcement learning algorithm with two key contributors: a novel biased adversarial reward estimator to deal with bandit feedback and unknown transitions, and a low-complexity index policy to satisfy the instantaneous activation constraint. We show regret bound for our algorithm, where is the number of episodes and is the episode length. To our best knowledge, this is the first algorithm to ensure regret for adversarial RMAB in our considered challenging settings.
Paper Structure (24 sections, 20 theorems, 70 equations, 7 figures, 1 table, 2 algorithms)

This paper contains 24 sections, 20 theorems, 70 equations, 7 figures, 1 table, 2 algorithms.

Key Result

Lemma 4.1

With probability at least $1-2\epsilon$, the true transition functions are within the confidence sets, i.e., $P_n\in {\mathcal{P}}_n^{t}$, $\forall n\in[N], t\in[T].$

Figures (7)

  • Figure 1: The workflow of UCMD-ARMAB and its regret analysis. The dashed arrows present the aimed procedures for solving the original problem in \ref{['eq:orginal_P']}, and the solid arrows show the true procedures of UCMD-ARMAB. By relaxing the "hard" activation constraint as shown in \ref{['eq:relaxed_constraint']}, UCMD-ARMAB updates occupancy measure via OMD as in \ref{['eq:UCB_extended']} (see Section \ref{['sec:OMD']}), combined with the adversarial reward estimator in \ref{['eq:reward_estimator']} (see Section \ref{['sec:estimator']}). Then, it establishes the RMI index policy in \ref{['eq:importance-index']} (see Section \ref{['sec:index']}). These correspond to the three sources of learning regret, i.e., regret due to (i) OMD online optimization (Lemma \ref{['lem:term1']}), (ii) bandit-feedback adversarial reward (Lemma \ref{['lem:term2']}), and (iii) the RMI index policy (Lemma \ref{['lem:term3']}).
  • Figure 2: Accumulated Reward.
  • Figure 3: Accumulated Regret.
  • Figure 5: Accumulated reward.
  • Figure 6: Accumulated regret.
  • ...and 2 more figures

Theorems & Definitions (32)

  • Remark 3.1
  • Lemma 4.1
  • Remark 4.2
  • Remark 4.3
  • Theorem 4.4
  • Theorem 5.1
  • Lemma 5.2
  • Lemma 5.3
  • Lemma 5.4
  • Lemma 5.5
  • ...and 22 more