Table of Contents
Fetching ...

Flickering Multi-Armed Bandits

Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen

TL;DR

This work proposes and analyze a two-phase algorithm that employs a lazy random walk for exploration to efficiently identify the optimal arm, followed by a navigation and commitment phase for exploitation, and establishes high-probability and expected sublinear regret bounds for both graph settings.

Abstract

We introduce Flickering Multi-Armed Bandits (FMAB), a new MAB framework where the set of available arms (or actions) can change at each round, and the available set at any time may depend on the agent's previously selected arm. We model this constrained, evolving availability using random graph processes, where arms are nodes and the agent's movement is restricted to its local neighborhood. We analyze this problem under two random graph models: an i.i.d. Erdős--Rényi (ER) process and an Edge-Markovian process. We propose and analyze a two-phase algorithm that employs a lazy random walk for exploration to efficiently identify the optimal arm, followed by a navigation and commitment phase for exploitation. We establish high-probability and expected sublinear regret bounds for both graph settings. We show that the exploration cost of our algorithm is near-optimal by establishing a matching information-theoretic lower bound for this problem class, highlighting the fundamental cost of exploration under local-move constraints. We complement our theoretical guarantees with numerical simulations, including a scenario of a robotic ground vehicle scouting a disaster-affected region.

Flickering Multi-Armed Bandits

TL;DR

This work proposes and analyze a two-phase algorithm that employs a lazy random walk for exploration to efficiently identify the optimal arm, followed by a navigation and commitment phase for exploitation, and establishes high-probability and expected sublinear regret bounds for both graph settings.

Abstract

We introduce Flickering Multi-Armed Bandits (FMAB), a new MAB framework where the set of available arms (or actions) can change at each round, and the available set at any time may depend on the agent's previously selected arm. We model this constrained, evolving availability using random graph processes, where arms are nodes and the agent's movement is restricted to its local neighborhood. We analyze this problem under two random graph models: an i.i.d. Erdős--Rényi (ER) process and an Edge-Markovian process. We propose and analyze a two-phase algorithm that employs a lazy random walk for exploration to efficiently identify the optimal arm, followed by a navigation and commitment phase for exploitation. We establish high-probability and expected sublinear regret bounds for both graph settings. We show that the exploration cost of our algorithm is near-optimal by establishing a matching information-theoretic lower bound for this problem class, highlighting the fundamental cost of exploration under local-move constraints. We complement our theoretical guarantees with numerical simulations, including a scenario of a robotic ground vehicle scouting a disaster-affected region.
Paper Structure (114 sections, 41 theorems, 270 equations, 4 figures)

This paper contains 114 sections, 41 theorems, 270 equations, 4 figures.

Key Result

theorem 1

Consider the FMAB problem under the i.i.d. ER model (homogeneous or heterogeneous). For any failure probability $\delta \in (0,1)$, there exist absolute constants $c_1, c_2, c_3 > 0$ such that by setting the exploration length $T_0 \geq c_1 n \log(nT/\delta) + c_2 n \log(n/\delta)/\Delta_{\min}^2$,

Figures (4)

  • Figure 1: A 4-arm FMAB problem at $t=s, s+1, s+2$, with pull sequence $a_s=3, a_{s+1}=4, a_{s+2}=2$ (from $a_{s-1}=1$). (A) Problem view: Blue/White arms are accessible/inaccessible. Pulled arm has a dark blue border. (B) Graph view: Learner starts at the dark blue node ($a_{t-1}$) and can move to blue neighbors ($L_t(a_{t-1})$).
  • Figure 2: Average cumulative regret $R(t)/t$ over time.
  • Figure 3: Spatial visitation density (log-scale) at three representative stages of the mission. The heatmaps illustrate the algorithm's behavior transitioning from broad, near-uniform exploration (Rounds 1–100) to an emerging preference for good sites (Rounds 101–500), and finally to a sharp, focused exploitation of the optimal hotspot (Rounds 501–$T$).
  • Figure 4: (a) Average cumulative regret $R(t)/t$ over time for i.i.d ER case. The main plot and the inset (showing the initial $t \le 200$ rounds) demonstrate the algorithm's rapid convergence as it quickly learns to identify high-utility locations. (b) Same plot for the Edge-Markovian Case. (c) Box plot for the navigation cost for ER case with respect to the sparsity.

Theorems & Definitions (92)

  • theorem 1: Regret for i.i.d. ER Graphs
  • proof
  • remark 1: General Analysis and the Homogeneous Case
  • theorem 2: Regret for Edge-Markovian Graphs
  • proof
  • corollary 1: Expected Regret Bound
  • proof
  • remark 2: On the Near-Optimality of Exploration
  • lemma 1: ER Availability
  • proof
  • ...and 82 more