Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

Guojun Xiong; Jian Li

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

Guojun Xiong, Jian Li

TL;DR

This work tackles adversarial RMAB with unknown transition dynamics and bandit feedback, introducing UCMD-ARMAB to learn under an instantaneous activation constraint. It combines confidence-set based transition estimation, a time-dependent occupancy-measure formulation solved by Online Mirror Descent, a biased adversarial reward estimator for bandit feedback, and a low-complexity Reward Maximizing Index policy to enforce feasibility. The authors prove a regret bound of $\tilde{\mathcal{O}}(H\sqrt{T})$ (up to problem constants) and demonstrate empirical gains over a stochastic baseline in realistic RMAB-like tasks, highlighting practical viability in non-stationary, adversarial environments. The results advance the theoretical and practical understanding of learning in adversarial RMAB with unknown transitions and limited feedback, enabling provably efficient decision-making under tight instantaneous activation constraints.

Abstract

Restless multi-armed bandits (RMAB) play a central role in modeling sequential decision making problems under an instantaneous activation constraint that at most B arms can be activated at any decision epoch. Each restless arm is endowed with a state that evolves independently according to a Markov decision process regardless of being activated or not. In this paper, we consider the task of learning in episodic RMAB with unknown transition functions and adversarial rewards, which can change arbitrarily across episodes. Further, we consider a challenging but natural bandit feedback setting that only adversarial rewards of activated arms are revealed to the decision maker (DM). The goal of the DM is to maximize its total adversarial rewards during the learning process while the instantaneous activation constraint must be satisfied in each decision epoch. We develop a novel reinforcement learning algorithm with two key contributors: a novel biased adversarial reward estimator to deal with bandit feedback and unknown transitions, and a low-complexity index policy to satisfy the instantaneous activation constraint. We show $\tilde{\mathcal{O}}(H\sqrt{T})$ regret bound for our algorithm, where $T$ is the number of episodes and $H$ is the episode length. To our best knowledge, this is the first algorithm to ensure $\tilde{\mathcal{O}}(\sqrt{T})$ regret for adversarial RMAB in our considered challenging settings.

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

TL;DR

(up to problem constants) and demonstrate empirical gains over a stochastic baseline in realistic RMAB-like tasks, highlighting practical viability in non-stationary, adversarial environments. The results advance the theoretical and practical understanding of learning in adversarial RMAB with unknown transitions and limited feedback, enabling provably efficient decision-making under tight instantaneous activation constraints.

Abstract

regret bound for our algorithm, where

is the number of episodes and

is the episode length. To our best knowledge, this is the first algorithm to ensure

regret for adversarial RMAB in our considered challenging settings.

Paper Structure (24 sections, 20 theorems, 70 equations, 7 figures, 1 table, 2 algorithms)

This paper contains 24 sections, 20 theorems, 70 equations, 7 figures, 1 table, 2 algorithms.

Introduction
Related Work
Model and Problem Formulation
ARMAB: Adversarial RMAB
Online Setting and Learning Regret
RL Algorithm for ARMAB
Confidence Sets
Solving Relaxed ARMAB with OMD
Adversarial Reward Estimators
Index Policy for ARMAB
Analysis
Main Results
Proof Sketch
Numerical Study
Details and Proofs in Section \ref{['sec:learning']}
...and 9 more sections

Key Result

Lemma 4.1

With probability at least $1-2\epsilon$, the true transition functions are within the confidence sets, i.e., $P_n\in {\mathcal{P}}_n^{t}$, $\forall n\in[N], t\in[T].$

Figures (7)

Figure 1: The workflow of UCMD-ARMAB and its regret analysis. The dashed arrows present the aimed procedures for solving the original problem in \ref{['eq:orginal_P']}, and the solid arrows show the true procedures of UCMD-ARMAB. By relaxing the "hard" activation constraint as shown in \ref{['eq:relaxed_constraint']}, UCMD-ARMAB updates occupancy measure via OMD as in \ref{['eq:UCB_extended']} (see Section \ref{['sec:OMD']}), combined with the adversarial reward estimator in \ref{['eq:reward_estimator']} (see Section \ref{['sec:estimator']}). Then, it establishes the RMI index policy in \ref{['eq:importance-index']} (see Section \ref{['sec:index']}). These correspond to the three sources of learning regret, i.e., regret due to (i) OMD online optimization (Lemma \ref{['lem:term1']}), (ii) bandit-feedback adversarial reward (Lemma \ref{['lem:term2']}), and (iii) the RMI index policy (Lemma \ref{['lem:term3']}).
Figure 2: Accumulated Reward.
Figure 3: Accumulated Regret.
Figure 5: Accumulated reward.
Figure 6: Accumulated regret.
...and 2 more figures

Theorems & Definitions (32)

Remark 3.1
Lemma 4.1
Remark 4.2
Remark 4.3
Theorem 4.4
Theorem 5.1
Lemma 5.2
Lemma 5.3
Lemma 5.4
Lemma 5.5
...and 22 more

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

TL;DR

Abstract

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (32)