Table of Contents
Fetching ...

HAVER: Instance-Dependent Error Bounds for Maximum Mean Estimation and Applications to Q-Learning and Monte Carlo Tree Search

Tuan Ngo Nguyen, Jay Barrett, Kwang-Sung Jun

TL;DR

This work tackles the problem of estimating the maximum mean μ1 across K distributions from passive samples, a task central to reinforcement learning and search-based planning. It introduces HAVER, a novel Head Averaging estimator, and proves an instance-dependent MSE bound that matches the oracle rate and, in several regimes, surpasses it when many arms are near-optimal. The paper also derives concrete corollaries for equal-sample, all-arms-equal, and near-top-arm scenarios, showing accelerations up to 1/(K N) in some cases. Empirical results across multi-armed bandits, Q-learning, and MCTS demonstrate HAVER’s superior performance over established estimators, confirming its practical value for improving sample efficiency in RL and tree-search contexts.

Abstract

We study the problem of estimating the \emph{value} of the largest mean among K distributions via samples from them (rather than estimating \emph{which} distribution has the largest mean), which arises from various machine learning tasks including Q-learning and Monte Carlo Tree Search (MCTS). While there have been a few proposed algorithms, their performance analyses have been limited to their biases rather than a precise error metric. In this paper, we propose a novel algorithm called HAVER (Head AVERaging) and analyze its mean squared error. Our analysis reveals that HAVER has a compelling performance in two respects. First, HAVER estimates the maximum mean as well as the oracle who knows the identity of the best distribution and reports its sample mean. Second, perhaps surprisingly, HAVER exhibits even better rates than this oracle when there are many distributions near the best one. Both of these improvements are the first of their kind in the literature, and we also prove that the naive algorithm that reports the largest empirical mean does not achieve these bounds. Finally, we confirm our theoretical findings via numerical experiments where we implement HAVER in bandit, Q-learning, and MCTS algorithms. In these experiments, HAVER consistently outperforms the baseline methods, demonstrating its effectiveness across different applications.

HAVER: Instance-Dependent Error Bounds for Maximum Mean Estimation and Applications to Q-Learning and Monte Carlo Tree Search

TL;DR

This work tackles the problem of estimating the maximum mean μ1 across K distributions from passive samples, a task central to reinforcement learning and search-based planning. It introduces HAVER, a novel Head Averaging estimator, and proves an instance-dependent MSE bound that matches the oracle rate and, in several regimes, surpasses it when many arms are near-optimal. The paper also derives concrete corollaries for equal-sample, all-arms-equal, and near-top-arm scenarios, showing accelerations up to 1/(K N) in some cases. Empirical results across multi-armed bandits, Q-learning, and MCTS demonstrate HAVER’s superior performance over established estimators, confirming its practical value for improving sample efficiency in RL and tree-search contexts.

Abstract

We study the problem of estimating the \emph{value} of the largest mean among K distributions via samples from them (rather than estimating \emph{which} distribution has the largest mean), which arises from various machine learning tasks including Q-learning and Monte Carlo Tree Search (MCTS). While there have been a few proposed algorithms, their performance analyses have been limited to their biases rather than a precise error metric. In this paper, we propose a novel algorithm called HAVER (Head AVERaging) and analyze its mean squared error. Our analysis reveals that HAVER has a compelling performance in two respects. First, HAVER estimates the maximum mean as well as the oracle who knows the identity of the best distribution and reports its sample mean. Second, perhaps surprisingly, HAVER exhibits even better rates than this oracle when there are many distributions near the best one. Both of these improvements are the first of their kind in the literature, and we also prove that the naive algorithm that reports the largest empirical mean does not achieve these bounds. Finally, we confirm our theoretical findings via numerical experiments where we implement HAVER in bandit, Q-learning, and MCTS algorithms. In these experiments, HAVER consistently outperforms the baseline methods, demonstrating its effectiveness across different applications.

Paper Structure

This paper contains 27 sections, 43 theorems, 317 equations, 6 figures, 2 algorithms.

Key Result

Theorem 4

LEM achieves

Figures (6)

  • Figure 1: Uniform sampling instance. The results are averaged over 1000 trials.
  • Figure 2: Q-learning in the regular grid world environment. The results are averaged over 1000 trials. The optimal mean reward per step is the black line.
  • Figure 3: Q-learning in the inflated grid world with the number of actions at each state is duplicated to 4. The results are averaged over 1000 trials. The optimal mean reward per step is the black line.
  • Figure 4: MCTS applied to the FrozenLake environments (top: 4x4 environment, bottom: 8x8 environment). The results are averaged over 500 trials. The optimal total reward is the black line.
  • Figure 5: $K^{*}$-best instance. The results are averaged over 1000 trials.
  • ...and 1 more figures

Theorems & Definitions (84)

  • Definition 1: Sub-Gaussian distribution
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Corollary 8
  • Corollary 9
  • Corollary 10
  • Corollary 11
  • proof
  • Corollary 12
  • ...and 74 more