Detection Augmented Bandit Procedures for Piecewise Stationary MABs: A Modular Approach

Yu-Han Huang; Argyrios Gerogiannis; Subhonmesh Bose; Venugopal V. Veeravalli

Detection Augmented Bandit Procedures for Piecewise Stationary MABs: A Modular Approach

Yu-Han Huang, Argyrios Gerogiannis, Subhonmesh Bose, Venugopal V. Veeravalli

TL;DR

This work addresses non-stationary MABs modeled as piecewise stationary environments by proposing a modular Detection Augmented Bandit (DAB) framework that couples a stationary bandit algorithm with a change detector, augmented by forced exploration. The authors establish new instance-dependent and minimax lower bounds for PS-MAB regret, and develop a modular regret analysis that cleanly separates detector and bandit contributions, enabling order-optimal regret bounds for a broad class of detector-bandit combinations. The framework is demonstrated to be robust and flexible through extensive experiments that compare various GLR/GSR detectors with UCB, MOSS, and klUCB, showing competitive or superior regret and reliable detection performance. Overall, the modular DAB approach offers a plug-and-play methodology for non-stationary bandits with provable guarantees and practical effectiveness, with future work extending to more general slowly changing settings.

Abstract

Conventional Multi-Armed Bandit (MAB) algorithms are designed for stationary environments, where the reward distributions associated with the arms do not change with time. In many applications, however, the environment is more accurately modeled as being non-stationary. In this work, piecewise stationary MAB (PS-MAB) environments are investigated, in which the reward distributions associated with a subset of the arms change at some change-points and remain stationary between change-points. Our focus is on the asymptotic analysis of PS-MABs, for which practical algorithms based on change detection have been previously proposed. Our goal is to modularize the design and analysis of such Detection Augmented Bandit (DAB) procedures. To this end, we first provide novel, improved performance lower bounds for PS-MABs. Then, we identify the requirements for stationary bandit algorithms and change detectors in a DAB procedure that are needed for the modularization. We assume that the rewards are sub-Gaussian. Under this assumption and a condition on the separation of the change-points, we show that the analysis of DAB procedures can indeed be modularized, so that the regret bounds can be obtained in a unified manner for various combinations of change detectors and bandit algorithms. Through this analysis, we develop new modular DAB procedures that are order-optimal. Finally, we showcase the practical effectiveness of our modular DAB approach in our experiments, studying its regret performance compared to other methods and investigating its detection capabilities.

Detection Augmented Bandit Procedures for Piecewise Stationary MABs: A Modular Approach

TL;DR

Abstract

Paper Structure (18 sections, 5 theorems, 59 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 5 theorems, 59 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Lower Bounds on Regret Accumulation for PS-MABs
The DAB Framework
Modular Regret Analysis of DAB Procedures
Requirements for Stationary Bandit Algorithms and Change Detectors
Modularized Regret Analysis
Application to Various Combinations of Change Detectors and Stationary Bandit Algorithms
Experimental Study
Experimental Benchmark
Algorithms and Parameters
Practical Tuning of QCD Tests
Experimental Results
Regret Performance
Detection Performance
On the Necessity of Forced Exploration
...and 3 more sections

Key Result

Theorem 1

For any arbitrary PS-MAB procedure with sublinear regret, i.e., $R_{T} \leq cT^{p}$ for all $T\in \mathbb{N}$ for some $c >0$ and $p \in \left[ 0, 1\right)$ in any PS-MABs, there exists a PS-MAB instance with at most $N_{T}$ changes and suboptimality gaps greater than $\Delta$ (i.e., $\Delta_{a,k} \ which implies that $R_{T}=\Omega(AN_T\log(T/N_T))$.

Figures (5)

Figure 1: The general DAB procedure.
Figure 2: Illustration of the workflow of Procedure \ref{['alg:DAB']}.
Figure 3: Regret plots versus the time steps for $T=100000$, averaged over $2000$ runs.
Figure 4: The mean reward assignment process of the bandit instances with $N_{T}=2$.
Figure 5: Illustration of the event $\mathcal{G}$

Theorems & Definitions (8)

Theorem 1
Theorem 2
Theorem 3: modular regret upper bound for DAB procedures
Remark 1
Corollary 1
proof
Corollary 2
Remark 2

Detection Augmented Bandit Procedures for Piecewise Stationary MABs: A Modular Approach

TL;DR

Abstract

Detection Augmented Bandit Procedures for Piecewise Stationary MABs: A Modular Approach

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (8)