Detection Augmented Bandit Procedures for Piecewise Stationary MABs: A Modular Approach
Yu-Han Huang, Argyrios Gerogiannis, Subhonmesh Bose, Venugopal V. Veeravalli
TL;DR
This work addresses non-stationary MABs modeled as piecewise stationary environments by proposing a modular Detection Augmented Bandit (DAB) framework that couples a stationary bandit algorithm with a change detector, augmented by forced exploration. The authors establish new instance-dependent and minimax lower bounds for PS-MAB regret, and develop a modular regret analysis that cleanly separates detector and bandit contributions, enabling order-optimal regret bounds for a broad class of detector-bandit combinations. The framework is demonstrated to be robust and flexible through extensive experiments that compare various GLR/GSR detectors with UCB, MOSS, and klUCB, showing competitive or superior regret and reliable detection performance. Overall, the modular DAB approach offers a plug-and-play methodology for non-stationary bandits with provable guarantees and practical effectiveness, with future work extending to more general slowly changing settings.
Abstract
Conventional Multi-Armed Bandit (MAB) algorithms are designed for stationary environments, where the reward distributions associated with the arms do not change with time. In many applications, however, the environment is more accurately modeled as being non-stationary. In this work, piecewise stationary MAB (PS-MAB) environments are investigated, in which the reward distributions associated with a subset of the arms change at some change-points and remain stationary between change-points. Our focus is on the asymptotic analysis of PS-MABs, for which practical algorithms based on change detection have been previously proposed. Our goal is to modularize the design and analysis of such Detection Augmented Bandit (DAB) procedures. To this end, we first provide novel, improved performance lower bounds for PS-MABs. Then, we identify the requirements for stationary bandit algorithms and change detectors in a DAB procedure that are needed for the modularization. We assume that the rewards are sub-Gaussian. Under this assumption and a condition on the separation of the change-points, we show that the analysis of DAB procedures can indeed be modularized, so that the regret bounds can be obtained in a unified manner for various combinations of change detectors and bandit algorithms. Through this analysis, we develop new modular DAB procedures that are order-optimal. Finally, we showcase the practical effectiveness of our modular DAB approach in our experiments, studying its regret performance compared to other methods and investigating its detection capabilities.
