Batch Ensemble for Variance Dependent Regret in Stochastic Bandits
Asaf Cassel, Orin Levy, Yishay Mansour
TL;DR
This work tackles the exploration–exploitation trade-off in stochastic MAB by introducing Batch Ensemble, a simple ensemble-inspired method that uses a batching-based optimistic mean estimator and a single tunable parameter—the number of batches. The authors prove near-optimal, variance-aware regret bounds for Bernoulli arms, and show the analysis extends to broader distributions (symmetric, bounded, or with lower-bounded variance) without altering the algorithm, including an anytime variant. A distributed interpretation and practical implementation are discussed, highlighting low computational overhead and compatibility with parallel deployment. Empirical results on synthetic benchmarks demonstrate competitive performance against UCB-based and MARS-style methods, with favorable scalability and robustness to arm-scale variations. The approach offers a parameter-light, efficient alternative for variance-aware exploration that can be adapted to distributed settings and potentially extended to MDPs.
Abstract
Efficiently trading off exploration and exploitation is one of the key challenges in online Reinforcement Learning (RL). Most works achieve this by carefully estimating the model uncertainty and following the so-called optimistic model. Inspired by practical ensemble methods, in this work we propose a simple and novel batch ensemble scheme that provably achieves near-optimal regret for stochastic Multi-Armed Bandits (MAB). Crucially, our algorithm has just a single parameter, namely the number of batches, and its value does not depend on distributional properties such as the scale and variance of the losses. We complement our theoretical results by demonstrating the effectiveness of our algorithm on synthetic benchmarks.
