Table of Contents
Fetching ...

Lagrangian Index Policy for Restless Bandits with Average Reward

Konstantin Avrachenkov, Vivek S. Borkar, Pratik Shah

TL;DR

The article tackles restless bandits under a long-run average reward criterion and proposes a Lagrangian-based indexing policy (LIP) as a computationally light alternative to the Whittle index policy (WIP). By formulating a Lagrangian relaxation, LIP computes a per-state index via $\gamma^i(x)=Q^i(x,1)-Q^i(x,0)$ and supports online learning through tabular Q-learning and Deep Q-Networks, including a scalable restart-model analysis for validation. The authors provide an analytic Lagrangian index for the restart problem, demonstrate LIP's competitive performance against WIP across restart and deadline-scheduling benchmarks, and prove asymptotic optimality for homogeneous arms using exchangeability and De Finetti theory under a Global Attractor Hypothesis. The results show LIP offers near-optimal performance with reduced memory and simpler implementation, particularly in non-indexable settings, making it practically appealing for large-scale restless bandit applications. These contributions advance both theory and practice by delivering a tractable, robust, and scalable policy framework for average-reward restless bandits.

Abstract

We study the Lagrangian Index Policy (LIP) for restless multi-armed bandits with long-run average reward. In particular, we compare the performance of LIP with the performance of the Whittle Index Policy (WIP), both heuristic policies known to be asymptotically optimal under certain natural conditions. Even though in most cases their performances are very similar, in the cases when WIP shows bad performance, LIP continues to perform very well. We then propose reinforcement learning algorithms, both tabular and NN-based, to obtain online learning schemes for LIP in the model-free setting. The proposed reinforcement learning schemes for LIP require significantly less memory than the analogous schemes for WIP. We calculate analytically the Lagrangian index for the restart model, which applies to the optimal web crawling and the minimization of the weighted age of information. We also give a new proof of asymptotic optimality in case of homogeneous arms as the number of arms goes to infinity, based on exchangeability and de Finetti's theorem.

Lagrangian Index Policy for Restless Bandits with Average Reward

TL;DR

The article tackles restless bandits under a long-run average reward criterion and proposes a Lagrangian-based indexing policy (LIP) as a computationally light alternative to the Whittle index policy (WIP). By formulating a Lagrangian relaxation, LIP computes a per-state index via and supports online learning through tabular Q-learning and Deep Q-Networks, including a scalable restart-model analysis for validation. The authors provide an analytic Lagrangian index for the restart problem, demonstrate LIP's competitive performance against WIP across restart and deadline-scheduling benchmarks, and prove asymptotic optimality for homogeneous arms using exchangeability and De Finetti theory under a Global Attractor Hypothesis. The results show LIP offers near-optimal performance with reduced memory and simpler implementation, particularly in non-indexable settings, making it practically appealing for large-scale restless bandit applications. These contributions advance both theory and practice by delivering a tractable, robust, and scalable policy framework for average-reward restless bandits.

Abstract

We study the Lagrangian Index Policy (LIP) for restless multi-armed bandits with long-run average reward. In particular, we compare the performance of LIP with the performance of the Whittle Index Policy (WIP), both heuristic policies known to be asymptotically optimal under certain natural conditions. Even though in most cases their performances are very similar, in the cases when WIP shows bad performance, LIP continues to perform very well. We then propose reinforcement learning algorithms, both tabular and NN-based, to obtain online learning schemes for LIP in the model-free setting. The proposed reinforcement learning schemes for LIP require significantly less memory than the analogous schemes for WIP. We calculate analytically the Lagrangian index for the restart model, which applies to the optimal web crawling and the minimization of the weighted age of information. We also give a new proof of asymptotic optimality in case of homogeneous arms as the number of arms goes to infinity, based on exchangeability and de Finetti's theorem.

Paper Structure

This paper contains 17 sections, 8 theorems, 73 equations, 6 figures, 4 algorithms.

Key Result

Theorem 2.2

Under the conditions (timesteps) for time steps, the iterates (QupdateA1) and (lambdaupdateA1) converge a.s., i.e., and as $n \rightarrow \infty$.

Figures (6)

  • Figure 1: Restart Model. Subsidy (Lagrange multiplier).
  • Figure 2: Restart Model. LIP and WIP average reward comparison.
  • Figure 3: Non Whittle Indexable problem. Algorithm \ref{['tab1']}.
  • Figure 4: Non Whittle Indexable problem. Algorithm \ref{['tab2']}.
  • Figure 5: Deadline scheduling problem (homogeneous arms). Algorithm \ref{['tab:DQN']}.
  • ...and 1 more figures

Theorems & Definitions (16)

  • Remark 2.1
  • Theorem 2.2
  • proof
  • Theorem 2.3
  • proof
  • Theorem 2.4
  • Theorem 4.3
  • Remark 4.4
  • Theorem 4.5
  • Theorem 4.6
  • ...and 6 more