Batch Ensemble for Variance Dependent Regret in Stochastic Bandits

Asaf Cassel; Orin Levy; Yishay Mansour

Batch Ensemble for Variance Dependent Regret in Stochastic Bandits

Asaf Cassel, Orin Levy, Yishay Mansour

TL;DR

This work tackles the exploration–exploitation trade-off in stochastic MAB by introducing Batch Ensemble, a simple ensemble-inspired method that uses a batching-based optimistic mean estimator and a single tunable parameter—the number of batches. The authors prove near-optimal, variance-aware regret bounds for Bernoulli arms, and show the analysis extends to broader distributions (symmetric, bounded, or with lower-bounded variance) without altering the algorithm, including an anytime variant. A distributed interpretation and practical implementation are discussed, highlighting low computational overhead and compatibility with parallel deployment. Empirical results on synthetic benchmarks demonstrate competitive performance against UCB-based and MARS-style methods, with favorable scalability and robustness to arm-scale variations. The approach offers a parameter-light, efficient alternative for variance-aware exploration that can be adapted to distributed settings and potentially extended to MDPs.

Abstract

Efficiently trading off exploration and exploitation is one of the key challenges in online Reinforcement Learning (RL). Most works achieve this by carefully estimating the model uncertainty and following the so-called optimistic model. Inspired by practical ensemble methods, in this work we propose a simple and novel batch ensemble scheme that provably achieves near-optimal regret for stochastic Multi-Armed Bandits (MAB). Crucially, our algorithm has just a single parameter, namely the number of batches, and its value does not depend on distributional properties such as the scale and variance of the losses. We complement our theoretical results by demonstrating the effectiveness of our algorithm on synthetic benchmarks.

Batch Ensemble for Variance Dependent Regret in Stochastic Bandits

TL;DR

Abstract

Paper Structure (22 sections, 9 theorems, 45 equations, 5 figures, 1 algorithm)

This paper contains 22 sections, 9 theorems, 45 equations, 5 figures, 1 algorithm.

Introduction
Our Contributions.
Related work.
Preliminaries
Problem setup.
Deviation bounds.
Algorithm and Main Results
An Optimistic Mean Estimator
The Batch Ensemble Algorithm
Dependence on the true arm distributions.
A distributed view.
An anytime expected regret algorithm.
Beyond Bernoulli arms
Bernoulli-fication.
Scaled Bernoulli.
...and 7 more sections

Key Result

Lemma 1

Let $\bar{\mu} = \frac{1}{n} \sum_{n'=1}^{n} X_{n'}$. With probability at least $1 - \delta$

Figures (5)

Figure 1: Results for $5$ Bernoulli arms with clear, low-variance, best arm. The arms expectations tested are $0.001,0.15,0.2,0.25,0.3$.
Figure 2: Results for $10$ Bernoulli arms with means $0.9,0.91,0.92,\ldots,0.99$.
Figure 3: Results for $10$ Bernoulli arms with random means.
Figure 4: Results for $10$ Gaussian arms with random means and the variance 1.
Figure 5: $10$ Exponential arms with random scales.

Theorems & Definitions (14)

Lemma 1
Lemma 2: wiklund2023another, Corollary 1
Lemma 3
Proof
Lemma 4
Proof
theorem 5
theorem 6
Lemma 7
Proof
...and 4 more

Batch Ensemble for Variance Dependent Regret in Stochastic Bandits

TL;DR

Abstract

Batch Ensemble for Variance Dependent Regret in Stochastic Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (14)