Table of Contents
Fetching ...

MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment

Mohsen Amiri, Konstantin Avrachenkov, Ibtihal El Mimouni, Sindri Magnússon

TL;DR

MARBLE tackles restless multi-armed bandits under nonstationary regimes by introducing a latent Markovian environment that drives regime switches in arm dynamics, enabling realistic modeling of abrupt changes. The authors define Markov-Average Indexability (MAI) and develop synchronous Q-learning with Whittle indices (QWI), proving almost-sure convergence to the environment-averaged optima $\bar{\\mathbf Q}^{i,z}_{\\bar{\\lambda}_*}$ and $\\bar{\\lambda}^i_*(z)$ via two-timescale stochastic approximation. The approach relies on calibrated simulators (digital twins) to perform environment-averaged planning updates without observing the latent state, and it yields per-arm Whittle indices converging to the environment-averaged policy. Simulations on a digital-twin recommender demonstrate robust adaptation to regime changes and convergence to oracle policies, underscoring practical viability for nonstationary RMAB applications.

Abstract

Restless Multi-Armed Bandits (RMABs) are powerful models for decision-making under uncertainty, yet classical formulations typically assume fixed dynamics, an assumption often violated in nonstationary environments. We introduce MARBLE (Multi-Armed Restless Bandits in a Latent Markovian Environment), which augments RMABs with a latent Markov state that induces nonstationary behavior. In MARBLE, each arm evolves according to a latent environment state that switches over time, making policy learning substantially more challenging. We further introduce the Markov-Averaged Indexability (MAI) criterion as a relaxed indexability assumption and prove that, despite unobserved regime switches, under the MAI criterion, synchronous Q-learning with Whittle Indices (QWI) converges almost surely to the optimal Q-function and the corresponding Whittle indices. We validate MARBLE on a calibrated simulator-embedded (digital twin) recommender system, where QWI consistently adapts to a shifting latent state and converges to an optimal policy, empirically corroborating our theoretical findings.

MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment

TL;DR

MARBLE tackles restless multi-armed bandits under nonstationary regimes by introducing a latent Markovian environment that drives regime switches in arm dynamics, enabling realistic modeling of abrupt changes. The authors define Markov-Average Indexability (MAI) and develop synchronous Q-learning with Whittle indices (QWI), proving almost-sure convergence to the environment-averaged optima and via two-timescale stochastic approximation. The approach relies on calibrated simulators (digital twins) to perform environment-averaged planning updates without observing the latent state, and it yields per-arm Whittle indices converging to the environment-averaged policy. Simulations on a digital-twin recommender demonstrate robust adaptation to regime changes and convergence to oracle policies, underscoring practical viability for nonstationary RMAB applications.

Abstract

Restless Multi-Armed Bandits (RMABs) are powerful models for decision-making under uncertainty, yet classical formulations typically assume fixed dynamics, an assumption often violated in nonstationary environments. We introduce MARBLE (Multi-Armed Restless Bandits in a Latent Markovian Environment), which augments RMABs with a latent Markov state that induces nonstationary behavior. In MARBLE, each arm evolves according to a latent environment state that switches over time, making policy learning substantially more challenging. We further introduce the Markov-Averaged Indexability (MAI) criterion as a relaxed indexability assumption and prove that, despite unobserved regime switches, under the MAI criterion, synchronous Q-learning with Whittle Indices (QWI) converges almost surely to the optimal Q-function and the corresponding Whittle indices. We validate MARBLE on a calibrated simulator-embedded (digital twin) recommender system, where QWI consistently adapts to a shifting latent state and converges to an optimal policy, empirically corroborating our theoretical findings.

Paper Structure

This paper contains 11 sections, 2 theorems, 29 equations, 1 figure, 1 algorithm.

Key Result

Theorem 4.1

Assume Assumptions assump:bounded, assump:stepsizes-timescale, and assump:indexability-MARBLE (MAI) hold, and for all arms $i \in \mathcal{N}$ the calibrated simulators $\mathcal{G}_i(\theta)$ with known $\theta$ are available. Then, under the MARBLE model, QWI in Algorithm Alg:QWI converges almost

Figures (1)

  • Figure 1: Performance of synchronous QWI over $500{,}000$ iterations on the push-notification recommender task.

Theorems & Definitions (5)

  • Definition 2.1: Indexability
  • Definition 3.2: MARBLE
  • Theorem 4.1
  • proof
  • Theorem 4.2: Borkar’s Two–timescale SA Theorem borkar1997stochasticborkar2008stochastic