Table of Contents
Fetching ...

Tabular and Deep Learning for the Whittle Index

Francisco Robledo Relaño, Vivek Borkar, Urtzi Ayesta, Konstantin Avrachenkov

TL;DR

This work addresses learning Whittle indices for Restless Multi-Armed Bandit Problems under a discounted reward criterion by introducing two online algorithms: a tabular method (QWI) and a neural-network-based extension (QWINN) that leverages a two-time-scale stochastic approximation. QWI provably converges to the true Whittle indices for indexable RMABPs, while QWINN generalizes to large state spaces using neural networks and proves local convergence to a neighborhood of a local minimum of the Bellman error, under a contraction assumption. Empirical results show that QWI and QWINN converge faster than standard Q-learning, DQN, and NeurWIN across restart, deadline scheduling, and circular problems, with QWINN excelling in large-scale or heterogeneous settings due to its extrapolative capability. The methods offer practical, scalable means to implement near-optimal Whittle-index policies in real-time RMABP applications, with NeurWIN occasionally yielding suboptimal index orders in complex scenarios.

Abstract

The Whittle index policy is a heuristic that has shown remarkably good performance (with guaranteed asymptotic optimality) when applied to the class of problems known as Restless Multi-Armed Bandit Problems (RMABPs). In this paper we present QWI and QWINN, two reinforcement learning algorithms, respectively tabular and deep, to learn the Whittle index for the total discounted criterion. The key feature is the use of two time-scales, a faster one to update the state-action Q -values, and a relatively slower one to update the Whittle indices. In our main theoretical result we show that QWI, which is a tabular implementation, converges to the real Whittle indices. We then present QWINN, an adaptation of QWI algorithm using neural networks to compute the Q -values on the faster time-scale, which is able to extrapolate information from one state to another and scales naturally to large state-space environments. For QWINN, we show that all local minima of the Bellman error are locally stable equilibria, which is the first result of its kind for DQN-based schemes. Numerical computations show that QWI and QWINN converge faster than the standard Q -learning algorithm, neural-network based approximate Q-learning and other state of the art algorithms.

Tabular and Deep Learning for the Whittle Index

TL;DR

This work addresses learning Whittle indices for Restless Multi-Armed Bandit Problems under a discounted reward criterion by introducing two online algorithms: a tabular method (QWI) and a neural-network-based extension (QWINN) that leverages a two-time-scale stochastic approximation. QWI provably converges to the true Whittle indices for indexable RMABPs, while QWINN generalizes to large state spaces using neural networks and proves local convergence to a neighborhood of a local minimum of the Bellman error, under a contraction assumption. Empirical results show that QWI and QWINN converge faster than standard Q-learning, DQN, and NeurWIN across restart, deadline scheduling, and circular problems, with QWINN excelling in large-scale or heterogeneous settings due to its extrapolative capability. The methods offer practical, scalable means to implement near-optimal Whittle-index policies in real-time RMABP applications, with NeurWIN occasionally yielding suboptimal index orders in complex scenarios.

Abstract

The Whittle index policy is a heuristic that has shown remarkably good performance (with guaranteed asymptotic optimality) when applied to the class of problems known as Restless Multi-Armed Bandit Problems (RMABPs). In this paper we present QWI and QWINN, two reinforcement learning algorithms, respectively tabular and deep, to learn the Whittle index for the total discounted criterion. The key feature is the use of two time-scales, a faster one to update the state-action Q -values, and a relatively slower one to update the Whittle indices. In our main theoretical result we show that QWI, which is a tabular implementation, converges to the real Whittle indices. We then present QWINN, an adaptation of QWI algorithm using neural networks to compute the Q -values on the faster time-scale, which is able to extrapolate information from one state to another and scales naturally to large state-space environments. For QWINN, we show that all local minima of the Bellman error are locally stable equilibria, which is the first result of its kind for DQN-based schemes. Numerical computations show that QWI and QWINN converge faster than the standard Q -learning algorithm, neural-network based approximate Q-learning and other state of the art algorithms.
Paper Structure (20 sections, 2 theorems, 35 equations, 10 figures, 2 algorithms)

This paper contains 20 sections, 2 theorems, 35 equations, 10 figures, 2 algorithms.

Key Result

theorem 1

(Convergence of QWI) Given learning parameters $\alpha(n)$ and $\beta(n)$ such that $\sum_n \alpha(n) = \sum_n \beta(n) = \infty$, $\sum_n \alpha(n)^2 < \infty$, $\sum_n \beta(n)^2 < \infty$ and $\beta(n) = o (a(n))$ and that the problem satisfies the indexability condition, iterations (eq_q-learnin

Figures (10)

  • Figure 1: Histogram of eigenvalue moduli of $-(\nabla_1^2 \mathcal{E}(\theta^*, \theta^*))^{-1} \nabla_2\nabla_1 \mathcal{E}(\theta^*, \theta^*)$
  • Figure 2: Performance graphs for the "homogeneous restart" problem. (left) Bellman Relative Error $BRE(\pi_n^P)$, $P\in \{\textrm{QWI}, \textrm{QWINN}, \textrm{NeurWIN}, \textrm{DQN}, \textrm{Q-learning} \}$ during training for the "homogeneous restart" problem, $N=5, M=1, |S|=5$; and (right) Percentage of states in which an optimal action is not performed in the "restart" problem for homogeneous arms
  • Figure 3: Evolution of the Whittle index estimates for the restart problem
  • Figure 4: Performance graphs for the QWI, QWINN, NeurWIN, DQN and Q-learning algorithms: assignment of optimal policies in the heterogeneous "restart" problem (left) and discounted rewards for the homogeneous "deadline scheduling" problem (right).
  • Figure 5: Performance graphs for "deadline scheduling problem" using $N=5, M=2, |S|=130$.
  • ...and 5 more figures

Theorems & Definitions (2)

  • theorem 1
  • theorem 2