Table of Contents
Fetching ...

Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels

Md Kamran Chowdhury Shisher, Vishrant Tripathi, Mung Chiang, Christopher G. Brinton

TL;DR

This work tackles online resource allocation in restless multi-armed bandits with unknown and time-varying transition dynamics. It develops SW-Whittle, a sliding-window online policy that combines Lagrangian relaxation, optimistic transition estimation, and a Bandit-over-Bandit approach to handle unknown variation budgets. The method provides a dynamic regret guarantee of $\tilde{O}(T^{2/3}\tilde{V}^{1/3} + T^{4/5})$ for large RMABs and demonstrates superior performance over baselines in non-stationary environments. This advances practical RMAB deployment by enabling adaptive Whittle-index-based decisions under non-stationarity with theoretical regret bounds and empirical validation.

Abstract

We study optimal resource allocation in restless multi-armed bandits (RMABs) under unknown and non-stationary dynamics. Solving RMABs optimally is PSPACE-hard even with full knowledge of model parameters, and while the Whittle index policy offers asymptotic optimality with low computational cost, it requires access to stationary transition kernels - an unrealistic assumption in many applications. To address this challenge, we propose a Sliding-Window Online Whittle (SW-Whittle) policy that remains computationally efficient while adapting to time-varying kernels. Our algorithm achieves a dynamic regret of $\tilde O(T^{2/3}\tilde V^{1/3}+T^{4/5})$ for large RMABs, where $T$ is the number of episodes and $\tilde V$ is the total variation distance between consecutive transition kernels. Importantly, we handle the challenging case where the variation budget is unknown in advance by combining a Bandit-over-Bandit framework with our sliding-window design. Window lengths are tuned online as a function of the estimated variation, while Whittle indices are computed via an upper-confidence-bound of the estimated transition kernels and a bilinear optimization routine. Numerical experiments demonstrate that our algorithm consistently outperforms baselines, achieving the lowest cumulative regret across a range of non-stationary environments.

Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels

TL;DR

This work tackles online resource allocation in restless multi-armed bandits with unknown and time-varying transition dynamics. It develops SW-Whittle, a sliding-window online policy that combines Lagrangian relaxation, optimistic transition estimation, and a Bandit-over-Bandit approach to handle unknown variation budgets. The method provides a dynamic regret guarantee of for large RMABs and demonstrates superior performance over baselines in non-stationary environments. This advances practical RMAB deployment by enabling adaptive Whittle-index-based decisions under non-stationarity with theoretical regret bounds and empirical validation.

Abstract

We study optimal resource allocation in restless multi-armed bandits (RMABs) under unknown and non-stationary dynamics. Solving RMABs optimally is PSPACE-hard even with full knowledge of model parameters, and while the Whittle index policy offers asymptotic optimality with low computational cost, it requires access to stationary transition kernels - an unrealistic assumption in many applications. To address this challenge, we propose a Sliding-Window Online Whittle (SW-Whittle) policy that remains computationally efficient while adapting to time-varying kernels. Our algorithm achieves a dynamic regret of for large RMABs, where is the number of episodes and is the total variation distance between consecutive transition kernels. Importantly, we handle the challenging case where the variation budget is unknown in advance by combining a Bandit-over-Bandit framework with our sliding-window design. Window lengths are tuned online as a function of the estimated variation, while Whittle indices are computed via an upper-confidence-bound of the estimated transition kernels and a bilinear optimization routine. Numerical experiments demonstrate that our algorithm consistently outperforms baselines, achieving the lowest cumulative regret across a range of non-stationary environments.

Paper Structure

This paper contains 19 sections, 4 theorems, 47 equations, 1 figure, 1 table, 2 algorithms.

Key Result

Lemma 1

Given $\eta\geq 0$, the probability that the true kernel $P_{n,t}$ lies within the high-dimensional Ball $B_{t}^{(n)}$ (described by eq. UCB) is greater than or equal to $1-\eta$, i.e., $\mathrm{Pr}( P_{n,t} \in B_{t}^{(n)}, \forall n, \forall t) \geq 1-\eta$.

Figures (1)

  • Figure 1: $\mathrm{Reg(T)}$ Vs. number of episodes in Scheduling and 1-D Bandit with $N=20, M=4$.

Theorems & Definitions (7)

  • Definition 1: Indexability
  • Definition 2: Whittle Index
  • Lemma 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Remark 1