Table of Contents
Fetching ...

Inference with the Upper Confidence Bound Algorithm

Koulik Khamaru, Cun-Hui Zhang

TL;DR

The paper addresses inference with data collected by adaptive bandit procedures, focusing on the Upper Confidence Bound (UCB) algorithm. It introduces a stability property for arm pulls inspired by Lai and Wei and proves that UCB satisfies this stability, yielding asymptotically normal arm means and valid confidence intervals under fixed numbers of arms. It further shows that the stability framework extends to growing numbers of arms under a mild growth condition, with a substantial set of near-optimal arms, by establishing analogous asymptotic behavior of arm pulls. The results rely on a Martingale CLT and Lai & Wei’s stability theory to enable downstream inference on adaptively collected data, offering practical CI construction and variance-consistency results. Overall, the work provides a principled route to reliable, asymptotically exact inference in sequential decision-making settings and informs how stability can be leveraged in adaptive experiments.

Abstract

In this paper, we discuss the asymptotic behavior of the Upper Confidence Bound (UCB) algorithm in the context of multiarmed bandit problems and discuss its implication in downstream inferential tasks. While inferential tasks become challenging when data is collected in a sequential manner, we argue that this problem can be alleviated when the sequential algorithm at hand satisfies certain stability property. This notion of stability is motivated from the seminal work of Lai and Wei (1982). Our first main result shows that such a stability property is always satisfied for the UCB algorithm, and as a result the sample means for each arm are asymptotically normal. Next, we examine the stability properties of the UCB algorithm when the number of arms $K$ is allowed to grow with the number of arm pulls $T$. We show that in such a case the arms are stable when $\frac{\log K}{\log T} \rightarrow 0$, and the number of near-optimal arms are large.

Inference with the Upper Confidence Bound Algorithm

TL;DR

The paper addresses inference with data collected by adaptive bandit procedures, focusing on the Upper Confidence Bound (UCB) algorithm. It introduces a stability property for arm pulls inspired by Lai and Wei and proves that UCB satisfies this stability, yielding asymptotically normal arm means and valid confidence intervals under fixed numbers of arms. It further shows that the stability framework extends to growing numbers of arms under a mild growth condition, with a substantial set of near-optimal arms, by establishing analogous asymptotic behavior of arm pulls. The results rely on a Martingale CLT and Lai & Wei’s stability theory to enable downstream inference on adaptively collected data, offering practical CI construction and variance-consistency results. Overall, the work provides a principled route to reliable, asymptotically exact inference in sequential decision-making settings and informs how stability can be leveraged in adaptive experiments.

Abstract

In this paper, we discuss the asymptotic behavior of the Upper Confidence Bound (UCB) algorithm in the context of multiarmed bandit problems and discuss its implication in downstream inferential tasks. While inferential tasks become challenging when data is collected in a sequential manner, we argue that this problem can be alleviated when the sequential algorithm at hand satisfies certain stability property. This notion of stability is motivated from the seminal work of Lai and Wei (1982). Our first main result shows that such a stability property is always satisfied for the UCB algorithm, and as a result the sample means for each arm are asymptotically normal. Next, we examine the stability properties of the UCB algorithm when the number of arms is allowed to grow with the number of arm pulls . We show that in such a case the arms are stable when , and the number of near-optimal arms are large.
Paper Structure (30 sections, 5 theorems, 75 equations, 1 figure, 1 algorithm)

This paper contains 30 sections, 5 theorems, 75 equations, 1 figure, 1 algorithm.

Key Result

Theorem 3.1

Suppose we pull bandit arms using Algorithm algo:UCB. Let Assumptions eqn:arm-mean-diff-UB-eqn:sub-Gaussian be in force and the number of arms $K$ fixed. Then, for each arm $a \in [K]$, the number of arm pulls $n_{a, T}$ satisfies where $n^\star \equiv n^\star(T, \{\Delta_a\}_{a \in [K]})$ is the unique solution to the following equation Here, $\Delta_a = \mu_1 - \mu_a$ and without loss of gener

Figures (1)

  • Figure 1: Distribution of $\frac{\hat{\mu}_2 - \mu_2}{\sqrt{N_{2}}}$, the standard error of sample arm mean for arm 2. The distribution of the sample mean deviates from standards normal distribution (left panel) when the $\epsilon$-greedy algorithm is used. The distribution is in good accordance with a standard normal distribution when the UCB algorithm is used (right panel). The results are averaged over $5000$ repetitions. See Section \ref{['sec:A-closer-look']} for a detailed discussion. The code for the plot can be found in this https://github.com/KoulikBerkeley/UCB-stability-plots/tree/main.

Theorems & Definitions (6)

  • Definition 2.1
  • Theorem 3.1
  • Theorem 3.2
  • Corollary 1
  • Theorem 4.1
  • Lemma 5.1