Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels
Md Kamran Chowdhury Shisher, Vishrant Tripathi, Mung Chiang, Christopher G. Brinton
TL;DR
This work tackles online resource allocation in restless multi-armed bandits with unknown and time-varying transition dynamics. It develops SW-Whittle, a sliding-window online policy that combines Lagrangian relaxation, optimistic transition estimation, and a Bandit-over-Bandit approach to handle unknown variation budgets. The method provides a dynamic regret guarantee of $\tilde{O}(T^{2/3}\tilde{V}^{1/3} + T^{4/5})$ for large RMABs and demonstrates superior performance over baselines in non-stationary environments. This advances practical RMAB deployment by enabling adaptive Whittle-index-based decisions under non-stationarity with theoretical regret bounds and empirical validation.
Abstract
We study optimal resource allocation in restless multi-armed bandits (RMABs) under unknown and non-stationary dynamics. Solving RMABs optimally is PSPACE-hard even with full knowledge of model parameters, and while the Whittle index policy offers asymptotic optimality with low computational cost, it requires access to stationary transition kernels - an unrealistic assumption in many applications. To address this challenge, we propose a Sliding-Window Online Whittle (SW-Whittle) policy that remains computationally efficient while adapting to time-varying kernels. Our algorithm achieves a dynamic regret of $\tilde O(T^{2/3}\tilde V^{1/3}+T^{4/5})$ for large RMABs, where $T$ is the number of episodes and $\tilde V$ is the total variation distance between consecutive transition kernels. Importantly, we handle the challenging case where the variation budget is unknown in advance by combining a Bandit-over-Bandit framework with our sliding-window design. Window lengths are tuned online as a function of the estimated variation, while Whittle indices are computed via an upper-confidence-bound of the estimated transition kernels and a bilinear optimization routine. Numerical experiments demonstrate that our algorithm consistently outperforms baselines, achieving the lowest cumulative regret across a range of non-stationary environments.
