Table of Contents
Fetching ...

A Framework for Fair Evaluation of Variance-Aware Bandit Algorithms

Elise Wolf

TL;DR

This work tackles the reproducible evaluation of variance-aware bandit algorithms by proposing a standardized framework, the Bandit Playground, to compare eight MAB algorithms (spanning classical, variance-aware, and non-variance-aware) under controlled conditions. Using three Bernoulli-arm scenarios with horizon $T=10^6$ and $100$ trials, the study demonstrates that variance-aware methods (e.g., UCB-Tuned, EUCBV) provide robustness and improved performance in high-uncertainty, small-gap settings, while well-tuned classical approaches (e.g., ETC with carefully chosen $m$ or UCB variants) perform best in easier or highly separable environments. The contributions include a transparent, extensible evaluation framework and empirical insights into the conditions under which variance-aware strategies outperform their classical counterparts, informing practical algorithm selection for risk-sensitive applications. Overall, the paper advances reproducibility in bandit research and offers concrete guidance on when variance-aware exploration yields meaningful benefits across benchmark scenarios.

Abstract

Multi-armed bandit (MAB) problems serve as a fundamental building block for more complex reinforcement learning algorithms. However, evaluating and comparing MAB algorithms remains challenging due to the lack of standardized conditions and replicability. This is particularly problematic for variance-aware extensions of classical methods like UCB, whose performance can heavily depend on the underlying environment. In this study, we address how performance differences between bandit algorithms can be reliably observed, and under what conditions variance-aware algorithms outperform classical ones. We present a reproducible evaluation designed to systematically compare eight classical and variance-aware MAB algorithms. The evaluation framework, implemented in our Bandit Playground codebase, features clearly defined experimental setups, multiple performance metrics (reward, regret, reward distribution, value-at-risk, and action optimality), and an interactive evaluation interface that supports consistent and transparent analysis. We show that variance-aware algorithms can offer advantages in settings with high uncertainty where the difficulty arises from subtle differences between arm rewards. In contrast, classical algorithms often perform equally well or better in more separable scenarios or if fine-tuned extensively. Our contributions are twofold: (1) a framework for systematic evaluation of MAB algorithms, and (2) insights into the conditions under which variance-aware approaches outperform their classical counterparts.

A Framework for Fair Evaluation of Variance-Aware Bandit Algorithms

TL;DR

This work tackles the reproducible evaluation of variance-aware bandit algorithms by proposing a standardized framework, the Bandit Playground, to compare eight MAB algorithms (spanning classical, variance-aware, and non-variance-aware) under controlled conditions. Using three Bernoulli-arm scenarios with horizon and trials, the study demonstrates that variance-aware methods (e.g., UCB-Tuned, EUCBV) provide robustness and improved performance in high-uncertainty, small-gap settings, while well-tuned classical approaches (e.g., ETC with carefully chosen or UCB variants) perform best in easier or highly separable environments. The contributions include a transparent, extensible evaluation framework and empirical insights into the conditions under which variance-aware strategies outperform their classical counterparts, informing practical algorithm selection for risk-sensitive applications. Overall, the paper advances reproducibility in bandit research and offers concrete guidance on when variance-aware exploration yields meaningful benefits across benchmark scenarios.

Abstract

Multi-armed bandit (MAB) problems serve as a fundamental building block for more complex reinforcement learning algorithms. However, evaluating and comparing MAB algorithms remains challenging due to the lack of standardized conditions and replicability. This is particularly problematic for variance-aware extensions of classical methods like UCB, whose performance can heavily depend on the underlying environment. In this study, we address how performance differences between bandit algorithms can be reliably observed, and under what conditions variance-aware algorithms outperform classical ones. We present a reproducible evaluation designed to systematically compare eight classical and variance-aware MAB algorithms. The evaluation framework, implemented in our Bandit Playground codebase, features clearly defined experimental setups, multiple performance metrics (reward, regret, reward distribution, value-at-risk, and action optimality), and an interactive evaluation interface that supports consistent and transparent analysis. We show that variance-aware algorithms can offer advantages in settings with high uncertainty where the difficulty arises from subtle differences between arm rewards. In contrast, classical algorithms often perform equally well or better in more separable scenarios or if fine-tuned extensively. Our contributions are twofold: (1) a framework for systematic evaluation of MAB algorithms, and (2) insights into the conditions under which variance-aware approaches outperform their classical counterparts.

Paper Structure

This paper contains 27 sections, 2 equations, 5 tables.

Theorems & Definitions (2)

  • Definition 2.1: Regret, lattimore
  • Definition 2.2: Estimated Action Value, sutton