Table of Contents
Fetching ...

Deep Index Policy for Multi-Resource Restless Matching Bandit and Its Application in Multi-Channel Scheduling

Nida Zamir, I-Hong Hou

TL;DR

This work addresses multi-resource scheduling under unknown and heterogeneous dynamics by formulating a multi-resource restless matching bandit (MR-RMB) model and introducing the Deep Index Policy (DIP). DIP learns partial indexes $w_{n,h}$ via a policy gradient theorem tailored for auxiliary MDPs, using an actor-critic architecture with multiple resources per arm and a Max-Weight Index Matching backbone to allocate resources under capacity constraints. The authors derive a policy gradient expression for index learning, enable online adaptation without prior kernel knowledge, and demonstrate DIP's effectiveness across AoI minimization, holding-cost minimization, and online advertisement placement, where it outperforms DeepTOP and Whittle-index baselines, especially in heterogeneous settings. The results highlight DIP’s versatility and potential for broad applications beyond wireless scheduling, and point to future work on extending MR-RMB to multi-objective optimization scenarios.

Abstract

Scheduling in multi-channel wireless communication system presents formidable challenges in effectively allocating resources. To address these challenges, we investigate a multi-resource restless matching bandit (MR-RMB) model for heterogeneous resource systems with an objective of maximizing long-term discounted total rewards while respecting resource constraints. We have also generalized to applications beyond multi-channel wireless. We discuss the Max-Weight Index Matching algorithm, which optimizes resource allocation based on learned partial indexes. We have derived the policy gradient theorem for index learning. Our main contribution is the introduction of a new Deep Index Policy (DIP), an online learning algorithm tailored for MR-RMB. DIP learns the partial index by leveraging the policy gradient theorem for restless arms with convoluted and unknown transition kernels of heterogeneous resources. We demonstrate the utility of DIP by evaluating its performance for three different MR-RMB problems. Our simulation results show that DIP indeed learns the partial indexes efficiently.

Deep Index Policy for Multi-Resource Restless Matching Bandit and Its Application in Multi-Channel Scheduling

TL;DR

This work addresses multi-resource scheduling under unknown and heterogeneous dynamics by formulating a multi-resource restless matching bandit (MR-RMB) model and introducing the Deep Index Policy (DIP). DIP learns partial indexes via a policy gradient theorem tailored for auxiliary MDPs, using an actor-critic architecture with multiple resources per arm and a Max-Weight Index Matching backbone to allocate resources under capacity constraints. The authors derive a policy gradient expression for index learning, enable online adaptation without prior kernel knowledge, and demonstrate DIP's effectiveness across AoI minimization, holding-cost minimization, and online advertisement placement, where it outperforms DeepTOP and Whittle-index baselines, especially in heterogeneous settings. The results highlight DIP’s versatility and potential for broad applications beyond wireless scheduling, and point to future work on extending MR-RMB to multi-objective optimization scenarios.

Abstract

Scheduling in multi-channel wireless communication system presents formidable challenges in effectively allocating resources. To address these challenges, we investigate a multi-resource restless matching bandit (MR-RMB) model for heterogeneous resource systems with an objective of maximizing long-term discounted total rewards while respecting resource constraints. We have also generalized to applications beyond multi-channel wireless. We discuss the Max-Weight Index Matching algorithm, which optimizes resource allocation based on learned partial indexes. We have derived the policy gradient theorem for index learning. Our main contribution is the introduction of a new Deep Index Policy (DIP), an online learning algorithm tailored for MR-RMB. DIP learns the partial index by leveraging the policy gradient theorem for restless arms with convoluted and unknown transition kernels of heterogeneous resources. We demonstrate the utility of DIP by evaluating its performance for three different MR-RMB problems. Our simulation results show that DIP indeed learns the partial indexes efficiently.
Paper Structure (16 sections, 2 theorems, 26 equations, 6 figures, 3 algorithms)

This paper contains 16 sections, 2 theorems, 26 equations, 6 figures, 3 algorithms.

Key Result

Corollary 5.1

If arm $n$ is indexable, then setting $w^{\phi_h}_{h}(s)$ to be its partial index $w_{h}(s,\vec{\lambda}_{-h})$ maximizes $\mathcal{Q}^{\phi_h}(s,a,\lambda_h)$ for any $\lambda_h$.

Figures (6)

  • Figure 1: An illustration of Corollary \ref{['cor:1']}
  • Figure 2: AoI comparison for multi-channel wireless networks with heterogeneous channels
  • Figure 3: AoI comparison for multi-channel wireless networks with homogeneous channels.
  • Figure 4: Holding cost comparison for multi-channel wireless networks with heterogeneous channels
  • Figure 5: Holding cost comparison for multi-channel wireless networks with homogeneous channels
  • ...and 1 more figures

Theorems & Definitions (6)

  • Definition 4.1
  • Definition 4.2
  • Corollary 5.1
  • proof
  • Theorem 5.2
  • proof