Table of Contents
Fetching ...

Multi-Action Restless Bandits with Weakly Coupled Constraints: Simultaneous Learning and Control

Jing Fu, Bill Moran, José Niño-Mora

TL;DR

The work tackles online learning and control of Weakly Coupled Gangs (WCGs) of restless multi-action bandits, where transitions and rewards are initially unknown. It introduces the WCG-StimL framework, pairing a primary control policy with multiple learning processes to estimate $Q$-factors and derive online MP/Whittle-type indices, enabling coordination under multiple weakly coupled constraints. Theoretical results establish convergence in time and exponential convergence in the magnitude dimension, yielding online policies (OMPI and ALP/OALP) that are asymptotically optimal as system size grows, with exponentially diminishing suboptimality $O(e^{-h})$. The framework generalizes to finite-horizon LP-based approximations (ALP, OALP) that maintain optimality guarantees without non-degeneracy assumptions, making it practical for large-scale RMAB-type problems in settings with unknown dynamics.

Abstract

We study a system with finitely many groups of multi-action bandit processes, each of which is a Markov decision process (MDP) with finite state and action spaces and potentially different transition matrices when taking different actions. The bandit processes of the same group share the same state and action spaces and, given the same action that is taken, the same transition matrix. All the bandit processes across various groups are subject to multiple weakly coupled constraints over their state and action variables. Unlike the past studies that focused on the offline case, we consider the online case without assuming full knowledge of transition matrices and reward functions a priori and propose an effective scheme that enables simultaneous learning and control. We prove the convergence of the relevant processes in both the timeline and the number of the bandit processes, referred to as the convergence in the time and the magnitude dimensions. Moreover, we prove that the relevant processes converge exponentially fast in the magnitude dimension, leading to exponentially diminishing performance deviation between the proposed online algorithms and offline optimality.

Multi-Action Restless Bandits with Weakly Coupled Constraints: Simultaneous Learning and Control

TL;DR

The work tackles online learning and control of Weakly Coupled Gangs (WCGs) of restless multi-action bandits, where transitions and rewards are initially unknown. It introduces the WCG-StimL framework, pairing a primary control policy with multiple learning processes to estimate -factors and derive online MP/Whittle-type indices, enabling coordination under multiple weakly coupled constraints. Theoretical results establish convergence in time and exponential convergence in the magnitude dimension, yielding online policies (OMPI and ALP/OALP) that are asymptotically optimal as system size grows, with exponentially diminishing suboptimality . The framework generalizes to finite-horizon LP-based approximations (ALP, OALP) that maintain optimality guarantees without non-degeneracy assumptions, making it practical for large-scale RMAB-type problems in settings with unknown dynamics.

Abstract

We study a system with finitely many groups of multi-action bandit processes, each of which is a Markov decision process (MDP) with finite state and action spaces and potentially different transition matrices when taking different actions. The bandit processes of the same group share the same state and action spaces and, given the same action that is taken, the same transition matrix. All the bandit processes across various groups are subject to multiple weakly coupled constraints over their state and action variables. Unlike the past studies that focused on the offline case, we consider the online case without assuming full knowledge of transition matrices and reward functions a priori and propose an effective scheme that enables simultaneous learning and control. We prove the convergence of the relevant processes in both the timeline and the number of the bandit processes, referred to as the convergence in the time and the magnitude dimensions. Moreover, we prove that the relevant processes converge exponentially fast in the magnitude dimension, leading to exponentially diminishing performance deviation between the proposed online algorithms and offline optimality.

Paper Structure

This paper contains 35 sections, 195 equations, 2 algorithms.