Table of Contents
Fetching ...

Precise Asymptotics and Refined Regret of Variance-Aware UCB

Yingying Fan, Yuxuan Han, Jinchi Lv, Xiaocong Xu, Zhengyuan Zhou

TL;DR

An asymptotic characterization of the arm-pulling rates for UCB-V is provided, establishing that UCB-V can achieve a more refined regret bound, previously unknown even for more complicate and advanced variance-aware online decision-making algorithms.

Abstract

In this paper, we study the behavior of the Upper Confidence Bound-Variance (UCB-V) algorithm for the Multi-Armed Bandit (MAB) problems, a variant of the canonical Upper Confidence Bound (UCB) algorithm that incorporates variance estimates into its decision-making process. More precisely, we provide an asymptotic characterization of the arm-pulling rates for UCB-V, extending recent results for the canonical UCB in Kalvit and Zeevi (2021) and Khamaru and Zhang (2024). In an interesting contrast to the canonical UCB, our analysis reveals that the behavior of UCB-V can exhibit instability, meaning that the arm-pulling rates may not always be asymptotically deterministic. Besides the asymptotic characterization, we also provide non-asymptotic bounds for the arm-pulling rates in the high probability regime, offering insights into the regret analysis. As an application of this high probability result, we establish that UCB-V can achieve a more refined regret bound, previously unknown even for more complicate and advanced variance-aware online decision-making algorithms.

Precise Asymptotics and Refined Regret of Variance-Aware UCB

TL;DR

An asymptotic characterization of the arm-pulling rates for UCB-V is provided, establishing that UCB-V can achieve a more refined regret bound, previously unknown even for more complicate and advanced variance-aware online decision-making algorithms.

Abstract

In this paper, we study the behavior of the Upper Confidence Bound-Variance (UCB-V) algorithm for the Multi-Armed Bandit (MAB) problems, a variant of the canonical Upper Confidence Bound (UCB) algorithm that incorporates variance estimates into its decision-making process. More precisely, we provide an asymptotic characterization of the arm-pulling rates for UCB-V, extending recent results for the canonical UCB in Kalvit and Zeevi (2021) and Khamaru and Zhang (2024). In an interesting contrast to the canonical UCB, our analysis reveals that the behavior of UCB-V can exhibit instability, meaning that the arm-pulling rates may not always be asymptotically deterministic. Besides the asymptotic characterization, we also provide non-asymptotic bounds for the arm-pulling rates in the high probability regime, offering insights into the regret analysis. As an application of this high probability result, we establish that UCB-V can achieve a more refined regret bound, previously unknown even for more complicate and advanced variance-aware online decision-making algorithms.

Paper Structure

This paper contains 32 sections, 22 theorems, 204 equations, 4 figures, 1 algorithm.

Key Result

Proposition 1

Recall that $\overline{\sigma}_a \equiv \overline{\sigma}_a(\rho;T)$ and $\overline{\Delta}_a \equiv \overline{\Delta}_a(\rho;T)$ for $a \in [2]$. Fix any $\delta > 0$, $\rho > 1$, and any positive integer $T \geq 4$ such that Then, with probability at least $1-\delta$, we have and

Figures (4)

  • Figure 1: The distributions of $n_{1,T}$ (optimal arm-pulling count) for UCB-V and UCB with $T = 50,000$ over $5000$ repetitions.
  • Figure 2: (a): The regrets of UCB-V with $\sigma_2 \asymp T^{-1/4},\Delta_2 \asymp 1/\sqrt{T}$ fixed and $\sigma_1 \asymp T^{-1/2},T^{-1/4},1$, each instance with $10$ repetitions. (b): The median and $30\%$ quantile of $n_{1,T}$ (optimal arm-pulling count) for UCB and UCB-V, under varying $\Lambda_T$ in the $\sigma_1 = \mathfrak{o}(\sigma_2)$ regime, with $T = 1,000,000$ over $30$ repetitions for each $\Delta_2$. The red dotted line is the predicted $n_{1,T}$ of UCB-V using \ref{['eq:fpe_star']}.
  • Figure 3: (a): The confidence region of $n_{1,T}$ under different $\Lambda_T$, with $\sigma_1 = \mathfrak{o}(\sigma_2)$. The dotted lines represent the exact and perturbed solutions of \ref{['eq:fpe']}, where the perturbed curves solve $f(\varphi) = 1\pm 1/\log T$. The UCB-V line shows the number of arm pulls under the UCB-V algorithm with $30\%$ quantile, with $T = 10^5$ over $30$ repetitions. (b): The ratio between the perturbed solution $f(\varphi) = 1 \pm 1/\log T$ and the exact solution $f(\varphi) = 1$ is shown for $T = 10^5, 10^7, 10^9$ under different $\Lambda_T$. It can be seen that the ratio deviates from 1 as $\Lambda_T \to 1$ with increasing $T$.
  • Figure 4: The empirical distributions of the $Z$-statistic for the sub-optimal arm for UCB and UCB-V with $\sigma_1 = 0, \sigma_2 = 1/4$ under different $\Lambda_T$, both with $T = 50,000$ over $2,000$ repetitions.

Theorems & Definitions (26)

  • Proposition 1
  • Lemma 2
  • Theorem 3
  • Example 1
  • Example 2
  • Definition 4
  • Proposition 5
  • Proposition 6
  • Theorem 7
  • Theorem 8
  • ...and 16 more